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Final  Evaluation  of  Mips  M/500 

Abstract:  In  response  to  a  request  from  the  DoD,  an  analysis  of  a  Reduced  Instruction  Set 
Computer  (RISC)  processor,  the  Mips  M/500,  was  performed.  All  aspects  of  processor 
capabilities  and  support  software  were  evaluated,  tested,  and  compared  to  familiar  Com¬ 
plex  Instruction  Set  Computer  (CISC)  architectures.  In  all  cases,  the  RISC  computer  and 
its  support  software  performed  better  than  a  comparable  CISC  computer.  This  report 
provides  the  general  and  specific  results  of  these  analyses,  along  with  the  recommen¬ 
dation  that  the  DoD  and  other  government  agencies  seriously  consider  this  or  other  RISC 
architectures  as  a  highly  viable  and  attractive  alternative  to  the  more  familiar  but  less 
efficient  CISC  architectures. 


1.  Introduction 

This  report  describes  our  evaluation  of  the  Mips  M/500  RISC  processor’  as  part  of  our  ongoing 
research  into  RISC  dass  architectures.  Our  intention  was  to  review  the  general  class  of  RISC 
architectures  using  the  Mips  M/500  as  an  example  of  this  type  of  machine,  rather  than  to  specifically 
evaluate  the  Mips  My500.  Although  it  is  difficult  to  generalize  about  the  behavior  of  all  RISC  proces¬ 
sors  from  the  performance  of  a  single  example,  we  have  tried  to  point  out  the  strengths  and 
weaknesses  of  the  Mips  M/500  in  relation  to  other  architectures,  and  we  have  tried  to  demonstrate 
how  the  shortcomings  and  positive  aspects  of  the  Mips  M/500  can  be  extrapolated  to  other  RISC 
class  machines. 

This  report  covers  our  findings,  offering  insights  into  the  strong  and  weak  points  of  the  Mips  M.'500. 
often  by  comparing  it  to  the  VAX  (for  both  hardware  evaluation  and  compiler  evaluation  purposes).  In 
analyzing  the  shortcomings  of  the  Mips  M/500.  we  try  to  offer  possible  solutions  and  present  com¬ 
parisons  to  the  general  RISC  dass  of  architectures. 

I  entered  into  the  RISC  assessment  project  with  a  strong  bias  toward  CISC  architectures.  I  looked  at 
this  project  as  an  interesting  exercise  in  which  I  would  have  my  suspicions  about  RISC  processors 
confirmed,  and  one  in  which  quite  consistently  I  would  find  vindication  for  the  CISC  side  in  the  great 
‘R/SC  versus  CISC"  debate. 

I  have,  however,  come  to  the  opposite  conclusion.  My  research  on  this  project  has  convinced  me 
(quite  consistently,  I  might  add)  that,  if  there  is  a  'right'  side  of  the  debate  to  be  on,  it  is  the  RISC 
side.  In  ail  features  -  execution  speed,  compiler  efficiency,  language  consistency,  and  code  size  - 
the  concept  of  a  reduced  instruction  set  computer  has  proven  to  be  the  correct  architectural  choice. 
The  term  reduced  has  in  no  way  implied  restricted,  nor  has  it  caused  the  horrific  increases  in  code 
size  that  CISC  proponents  tout  to  support  their  cause.  In  fact,  comparing  the  Mips  M/500  instruction 
set  usage  versus  the  Vax  instruction  set  usage,  we  found  that  the  instructions  used  by  the  Mips 
M/500  compilers  dosely  paralleled  those  used  by  the  Vax.  The  rnain  deviation  was  in  the  area  of 


'Th«  Mips  M/500  is  produced  by  Mips,  Incorporated.  Sunnyvale,  CA.  It  is  one  implemantation  of  the  R2000  processor 
architecture.  All  raferancas  in  this  report  to  the  Mips  M/$00  architecture  refer  to  the  R2000,  while  all  performance  statistics 
refer  to  the  Mips  M/500.  Mips,  Incorporated  also  manufactures  faster  versic  of  the  R2000  -  the  Mips  M/SOO  and  the  Mips 
M/1 000.  These  processors  were  not  evaluated  for  this  report 
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addressing  modes,  but,  by  and  large,  the  Vax  compilers  poorly  used  the  complex  modes  provided  by 
the  VAX  hardware. 


Daniel  V.  Klein 

Principal  Investigator 
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2.  Evaluation  Methodology 

Our  studies  concentrated  on  three  areas. 

1 .  instruction  set  conformance 

2.  benchmark  performance 

3.  compiler  and  assembler  effectiveness 

The  results  obtained  in  these  three  areas  of  research  are  elaborated  in  their  respective  chapters 
Here  we  describe  the  methodology  used  to  evaluate  the  Mips  M/500. 


2.1.  Compliance  with  DoD  CORE  MIPS  ISA 

Mips  Incorporated  had  previously  enrolled  in  the  DoD  CORE  ISA  standard.^  In  brief,  this  standard 
allows  a  hardware  manufacturer  to  specify  its  own  RISC  class  architecture  as  long  as  the  architec¬ 
ture  either  conforms  directly  to  the  standard,  or  can  provide  assembly  language  translators  from  the 
manufacturer  ISA  to  the  CORE  ISA  and  the  manufacturer’s  ISA.  Technically  speaking,  even  the  Vax 
satisfies  these  requirements,  but  the  extra  features  provided  by  the  Vax  are  considered  detrimental 
by  the  standard. 

To  determine  \whether  the  Mips  M/500  satisfies  the  requirements  of  the  CORE  ISA,  we  evaluated  the 
architecture  with  these  evaluation  criteria: 

1 .  Justification  for  extra  instructions  -  Are  the  instructions  that  Mips  added  to  the  CORE 
in  their  implementation  reasonable?  That  is,  can  they  be  generated  by  a  compiler,  are 
they  needed  to  perform  special  operating  system  functions,  or  are  they  reasonable  for 
use  in  specialized  high  level  applications? 

2.  Justification  for  removed  instructions  -  Did  Mips  exercise  reasonable  judgement  in 
eliding  instructions  from  the  CORE  in  their  implementation?  That  is,  can  the  functions 
of  the  missing  instructions  be  carried  out  with  other  instructions  or  combinations  of 
instructions?  Can  an  automatic  translator  perform  this  translation? 

3.  Number  and  classes  of  registers  -  Does  the  number  of  registers  meet  or  exceed  the 
requirements  of  the  CORE?  Are  tiie  registers  general,  or  are  there  special  case 
registers  which  must  be  used  in  special  ways?  How  do  special  case  registers  affect 
the  overall  design? 

4.  Is  it  RISC?  This  is  a  very  difficult  question  to  answer  since  we  have  not  yet  established 
what  "RISC’  means.  We  did,  however,  attempt  to  classify  the  Mips  M/500. 


*CORE  Mf  of  Auwnbly  LMnguaga  Inaiructions  for  MIPS  Batod  MicroProcotsors,  Vorsion  3.2,  January  1987;  originally 
anrlltan  by  Thomas  Qross,  Camogia  Motion  Univorslly;  maintainod  by  Robart  Rrth,  Softwaro  Enginaaring  Institute. 
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2.2.  Benchmark  Performance 

We  ran  many  standard  and  non-standard  benchmarks  on  the  Mips  M/500  (and  on  the  Vax  lor  com¬ 
parison  purposes).  Some  of  the  results  are  presented  in  chapter  4,  although  not  all  of  our  tests  are 
reported.  We  have  not  withheld  any  useful  information;  however,  some  of  the  benchmarks  were 
inconclusive  or  inapplicable.  The  benchmark  suite  consisted  of: 

1.  BYTE  Benchmarks  -  the  benchmark  suite  from  BYTE  magazine,  August  1983  and 
August  1984. 

2.  Whetstones  -  the  quintessential  floating  point  benchmark  (although  we  show  how  this 
is  an  inadequate  benchmark  to  use). 

3.  Dhrystones  -  an  integer  benchmark  similar  in  functionality  to  the  Whetstone  bench¬ 
mark. 

4.  EUUG  Workstation  Performance  -  a  set  of  simple  programs  released  by  the  European 
Unix  Users  Group  to  test  a  computer's  performance  under  varying  loads. 

5.  LinPak  -  Jack  Dongarra's  matrix  manipulation  benchmarks,  written  at  Argonne  Nation¬ 
al  Laboratories. 

6.  Spice  -  a  circuit  simulator  often  used  to  measure  processor  efficiency.  This  bench¬ 
mark  heavily  loads  the  floating  point  hardware. 

7.  LLNL  Loops  -  a  set  of  Fortran  kernels,  released  through  Lawrence  Livermore  Nation¬ 
al  Laboratories,  designed  to  exercise  the  floating  point  system. 

8.  Buchholz  -  an  artificial  benchmark  designed  at  IBM  to  measure  system  load  handling 
capabilities. 

9.  Fortran  FP  -  a  simplistic  benchmark  for  measuring  the  time  to  execute  various 
Fortran  floating  point  operations.  This  was  deemed  too  simplistic  to  report  on. 

1 0.  FFT  -  a  simple  FFT  algorithm,  analyzed  to  compare  compiler  efficiency. 

1 1 . 20  Queens  -  an  extension  of  the  8  queens  placement  problem,  analyzed  to  compare 
compiler  efficiency. 

12.  Unix  lex  -  a  lexer  generator  for  which  a  number  of  degenerate  lexical  specifications 
exist.  These  specifications  heavily  load  the  lexer  generator,  and  provide  a  reasonable 
“natural"  benchmark. 

1 3.  Ackermann's  Function  -  a  test  devised  to  evaluate  the  efficiency  of  a  compiler’s  recur¬ 
sive  code  analysis  and  generation. 


2.3.  Compiler  Performance 

The  Mips  compilers  were  carefully  examined  for  a  number  of  characteristics.  Many  of  these  charac¬ 
teristics  are  specific  to  the  Mips  M/500,  but  some  of  our  findings  can  easily  be  generalized  to  other 
compilers.  The  areas  of  compiler  performance  that  we  examined  are: 

1.  Compiler  speed  -  that  is,  speed  of  compilation  during  the  parsing,  code  generation, 
and  optimization  phases,  as  well  as  time  spent  on  assembly  and  assembly  level  code 
reorganization.  This  phase  of  analysis  looked  at  how  long  a  user  would  have  to  wait 
for  a  compilation  to  run,  regard'--  ^  of  the  optimality  of  the  generated  code. 

2.  Speed  of  generated  code  -  that  is,  how  fast  the  compiled  code  would  run.  This  test 
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was  varied  over  different  levels  of  compiler  optimization  and  was  tested  with  many  of 
the  benchmarks  described  above  in  section  2.2. 

3.  Optimizer  efficiency  -  that  is,  how  good  is  the  code  that  is  generated  by  the  compiler 
and  assembler.  To  evaluate  this,  we  looked  at  four  aspects  of  optimization: 

a.  Optimization  techniques  used  -  which  methods  are  used,  and  which  are  not 
used.  The  optimization  techniques  we  looked  for  were  include:  a-motion, 
p-motion,  o>-motion,  routine  hoisting,  loop-invariant  code  motion,  common  sub¬ 
expression  elimination,  arithmetic  expression  reorganization,  branch  optimiza¬ 
tion,  multi-way  branch  evaluation,  etc. 

b.  Register  usage  -  how  well  the  registers  are  allocated.  We  examined  register 
tracking,  register  re-use,  and  type  of  register  use  (i.e.„  addressing  mode  inter¬ 
actions  with  register  use),  as  well  as  register  usage  in  parameter  passing. 

c.  Instruction  utilization  -  how  well  the  instruction  set  of  the  native  machine  is 
used,  including  evaluations  on  the  efficiency  of  code  idioms  used,  and  on  the 
optimality  of  the  generated  code.  We  also  examined  the  code  that  was  gener¬ 
ated  for  algorithms  that  could  be  written  in  the  three  languages  available  on  the 
Mips  M/500:  Fortran,  C,  Pascal. 

d.  Instruction  coverage  -  how  completely  the  instruction  set  of  the  native  machine 
is  used,  including  percentages  of  used  versus  unused  instruction. 

These  topics  are  all  discussed  in  chapter  6. 

4.  Assembly  reorganization  and  pipelining  -  that  is,  how  well  the  assembler  reorganizer 
kept  the  pipeline  filled,  how  efficiently  nop  instructions  were  eliminated,  and  what  the 
reorganizer  was  able  to  accomplish  in  final  stage  peephole  optimization.  We  also 
looked  at  code  idioms  that  would  enhance  reorganization,  and  at  those  we  found  that 
hindered  reorganization.  This  topic  is  discussed  in  chapters  3  and  7,  and  again  in 
appendix  A. 


2.4.  Applicability 

We  also  examined  the  applicability  of  the  Mips  M/500  (and  of  RISC  architectures  in  general) 
common  environments.  The  problem  areas  that  we  considered  were: 

1 .  Usable  in  a  workstation  environment  -  a  single  user  (or  small  number  of  users)  devel¬ 
opment  station,  either  for  general  software,  or  software  targeted  for  an  embedded  ap¬ 
plication. 

2.  Usable  in  embedded  applications  -  placing  the  Mips  M/500  processor  chip  on  board  a 
platform  for  real-time  analysis  and  control. 

3.  Usable  in  included  applications  -  using  tfie  Mips  M/500  in  a  network  (either  as  a  stand 
atone  chip  or  as  a  workstation)  with  other,  potentially  different,  processors. 
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3.  Analysis  of  MIPS  Assembler  Reorganizer 

The  Mips  assembler  reorganizer  is  the  system  program  that  takes  Mips  assembly  language  instruc¬ 
tions  and  translates  them  into  the  Mips  M/500  native  machine  code.  As  one  of  its  side  functions,  it 
also  reorganizes  the  machine  code  to  eliminate  the  nop  instructions  mat  must  follow  instructions 
such  as  branches  and  jumps  may  be  eliminated.  This  reorganization  takes  advantage  of  the  pipeline 
nature  of  the  Mips  M/500  hardware.  Thaf  is,  once  an  instruction  is  loaded  in  the  pipeline,  it  will  be 
executed.  This  unconditional  execution  occurs  in  spite  of  any  jumps  or  branches  that  may  be  taken 


3.1.  Assembly  Reorganization 

As  a  simple  example  of  assembly  reorganization,  consider  the  instruction  sequence  shown  in  figure 
3-1.  In  this  simple  example,  the  numbers  in  registers  $5  and  $6  are  subtracted  and  the  result  is 
placed  into  register  $4.^  If  the  result  of  the  subtraction  is  non-zero,  branch  to  the  label  foe  other¬ 
wise,  increment  register  $4  by  1  and  continue. 

sub  $4, $5, $6 

bne  $4,$0,£oo 

add  $4,1 

Figure  3-1 :  Sample  Assembler  Input 

The  first  and  last  instructions  take  one  dock  cycle  each.  However,  the  branch  instruction  takes  two 
clock  cycles  -  one  to  determine  whether  the  condition  is  true  or  not,  and  another  to  load  the  program 
counter  with  the  address  of  the  new  instruction  if  the  condition  is  true.'*  Thus,  given  the  assembler 
input  in  figure  3-1 ,  the  assembler  would  generate  the  machine  language  code  seen  in  figure  3-2. 

sub  a0,«l,«2 

bne  aO , zero , £oo 

nop 

addi  aO,l 

Figure  3-2:  Sample  Machine  Language  Output 

The  assembler  reorganizer  has  added  a  nop  instruction  following  the  conditional  branch.  Since  the 
cyde  following  the  branch  instruction  can  be  filled  with  an  instruction,  the  assembler  reorganizer 
must  take  care  to  ensure  that  the  add!  instruction^  is  executed  only  if  the  branch  is  not  taken.  Since 
the  destination  of  the  subtract  instruction  is  the  source  operand  of  the  comparison,  the  reorganizer 
cannot  perform  any  assembly  reorganization.  However,  consider  the  sample  assembler  source  in 
figure  3-3. 


*rh«M  ragistar  namaa  will  ba  changad  to  lha  logicai  namaa  al,  a2,  and  aO,  raapactivaly.  by  tha  diaasaamblar.  Howaver. 
tha  location  of  tha  ragiatara  is  tha  sama,  ragardlaas  of  thair  namaa.  Tha  list  cf  ragistar  numbar  to  ragistar  nama  mappings  is 
found  in  tabla  A-1  in  appendix  A. 

^a  second  cycle  is  expanded  whether  or  not  tha  branch  is  taken.  AddMionaHy,  it  is  worth  mentioning  that  tha  new 
program  counter  is  loaded  at  tha  arxf  of  tha  second  cyde,  so  that  tha  whole  cycle  may  ba  filled «  -  in  instruction  execution 

*Nota  tha  change  of  instruction  nama  between  the  Mips  assembler  input  and  the  Mips  MTSOO  machine  language  output 
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sub 

bne 

add 


$4, $5, $6 
$5, $0, foo 
$4,1 


Figure  3-3:  Sample  Assembler  Input 


In  this  case,  we  have  changed  the  source  of  the  conditional  branch  to  register  $5,  which  is  not  a 
direct  result  of  the  subtract  instruction.  What  the  assembler  reorganizer  will  produce,  given  this 
input,  is  shown  in  figure  3-4. 

bne  ml , zero , foo 
sub  a0,al,a2 

addi  a0,l 

Figure  3-4:  Sample  Reorganizer  Output 

Notice  that  the  order  of  the  Mips  M/500  machine  instructions  is  no  longer  the  same  as  that  of  the 
Mips  assembly  language  input.  In  fact,  it  would  appear  that  the  subtraction  occurs  only  after  the 
branch  is  rejected.  This  is  not  the  case,  however.  Recall  that  the  branch  instruction  always  takes 
two  cydes  to  execute,  and  that  the  instruction  following  the  branch  is  always  executed  regardless  of 
whether  or  not  the  branch  is  taken.  Therefore,  even  though  the  instruction  stream  does  not  look 
correct,  it  is  correct.  The  subtract  instruction  is  always  executed,  whether  or  not  the  branch  is  taken 
Thus,  register  $4  (that  is,  aO)  will  always  have  the  correct  result  in  it,  and  the  addi  instruction  is  only 
executed  if  the  branch  is  not  taken. 

3.2.  Translation  of  Mips  Assembly  Instructions 

As  mentioned  earlier,  the  assembler  reorganizer  will  change  the  name  of  an  assembler  instruction  to 
match  the  Mips  M/500  native  machine  language.  It  is  not  the  case,  however,  that  every  Mips  as¬ 
sembly  instruction  has  a  corresponding  Mips  M/500  native  machine  language  instruction.  Some¬ 
times  (as  is  the  case  for  conditional  branches),  the  inverse  condition  is  tested  with  reversed  argu¬ 
ments,  at  no  extra  instruction  count  expense.  Often,  however,  multiple  Mips  M/500  native  instruc¬ 
tions  are  substituted  for  a  single  Mips  assembler  instruction.  Consider  the  example  shown  in  figure 
3-5.  In  this  case,  we  have  substituted  the  the  muio  (multiply  with  overflow)  instruction  for  sub 
instruction. 

aulo  $4, $5, $6 
bge  $4,$0,foo 

add  $4,1 

Figure  3-5:  Sample  Assembler  Input 

What  happens  in  this  case  (as  shown  in  figure  3-6)  is  that  the  assembler  reorganizer  translates  the 
single  muio  instruction  into  a  sequence  of  8  Mips  M^OO  native  machine  language  instructions.  The 
additional  instructions  are  required  to  effect  the  overflow  checking  that  the  documentation  for  the 
muio  instruction  advertises  (as  being  part  of  a  single  instruction).  The  net  effect  of  this  legerdemain 
is  that  what  appears  to  be  a  single  instruction  (taking  a  single  machine  cycle)  is  instead  a  sequence 
of  8  instructions  taking  24  cycles  to  execute. 

Compilers  (such  as  those  for  strongly  typed  languages  like  Ada^)  are  not  required  to  use  the  muio 
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mult  «l,a2 

mf  lo  aO 

ara  a0,a0,31 

mfhi  at 

beq  a0,at,0xlc 

mflo  aO 

break  6 

nop 

bne  aO,K«xo,foo 

nop 

add  aO,l 

Figure  3-6:  Machine  Language  Output 

instmction  and  may  implement  their  own  overflow  checking  software  (see  section  3.2.1).  In  general, 
however,  the  instruction  counts  obtained  from  the  assembler  output  of  compilers  is  not  to  be  trusted 
as  a  measure  of  execution  cycles  (see  sectton  4.1  on  Ackermann’s  function  for  a  discussion  of  this 
subject).  Instead,  the  actual  executable  image  must  must  be  examined  to  determine  exactly  what 
instructions  will  be  executed.®  A  comprehensive  table  o1  all  instruction  translations  (and  accompa¬ 
nying  commentary)  is  in  appendix  A.  The  reader  is  strongly  encouraged  to  read  this  appendix  to 
correctly  understand  the  translation  from  the  Mips  high  level  instojction  set  to  the  Mips  M/500  native 
instruction  set. 

3.2.1.  Interesting  Effects  of  Multiplication 

The  Mips  instruction  set  provides  a  number  of  different  multiply  and  divide  instructions.  Although 
most  instructions  on  the  Mips  M/500  take  only  a  single  cycle  to  execute,  the  multiply  and  divide 
instructions  take  far  longer.  Thus,  it  is  in  the  best  interest  of  the  execution  speed  for  the 
assembler/reorganizer  to  change  multiply  instructions  into  sequences  of  shifts  artd  adds  or  subtracts. 
The  only  time  this  is  valid  is  when  the  value  of  one  of  the  multiplicands  is  known  (i.e.,  it  is  a  constant 
value).  The  assembler/reorganizer  will  substitute  the  appropriate  sequence  of  simpler  instructions 
only  when  the  execution  time  of  a  multiply  exceeds  that  of  a  sequence  of  shifts  and  adds.  When 
neither  of  the  multiplicands  is  a  constant  value,  the  assembler/reorganizer  uses  the  appropriate  Mips 
M/500  multiplication  instruction.^ 

in  the  worst  case,  the  number  of  instructions  that  will  be  generated  for  a  multiply  are  n-T  adds  and  n 
shifts,  where  n  is  the  number  of  1  bits  that  are  present  in  the  constant  multiplier.  Thus,  to  multiply  by 
the  constant  value  42,  the  instruction 
ml  $15,  $14,42 

is  converted  to  the  sequence  shown  in  figure  3-7.  The  number  42  (or  2#101010)  has  three  1  bits, 
and  so  the  number  of  instructions  is  3  shifts  and  2  adds. 

Where  there  are  mns  of  1's  with  no  intenrening  O’s,  the  number  of  instmctions  is  reduced.  Muttiply- 


*Altanng  a  single  instruction  may  subtly  change  the  actions  of  the  assembler  reorganizer,  and  critical  sections  of  code  must 
be examirted  with  great  care  foHmiing  any  modification. 

^Shifts  and  adds  can  always  be  used  for  multiplicalion.  The  problem  is  that  there  is  a  large  chance  that  the  time  it  takes  to 
executti  Vtese  instructions  is  greater  titan  the  multiply  instruction.  When  both  the  multiplier  and  the  multiplicand  are  constant 
values,  tite  compiler  precalculatBe  the  value  instead  ^  generating  njntime  code  to  perfom  the  function. 
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0x0: 

000e7880 

sll 

t7,t6,2 

0x4  : 

01ee7821 

addu 

t7,t7,t6 

0x8: 

000£7880 

all 

t7,t7,2 

Oxc : 

01ee7821 

addu 

t7,t7,t6 

0x10: 

000f7840 

sll 

t7,t7,l 

Figure  3-7:  Mips  M/500  Code  for  Multiplication  by  42 

ing  by  79,  for  example,  produces  only  2  shifts  and  two  adds  {one  of  the  adds  is  a  subtraction),  even 
though  79  (or  2#1 001 1 1 1 )  contains  5  bits  that  are  1 .  The  Mips  M/500  code  is  shown  in  figure  3-8. 


0x0: 

000e7880 

sll 

t7,t6,2 

0x4; 

01ee782l 

addu 

t7,t7,t6 

0x8: 

000f7900 

sll 

t7,t7,4 

Oxc : 

01ee7823 

subu 

t7,t7,t6 

Rgure  3-8:  Mips  M/500  Code  for  Multiplication  by  79 

Figure  3-9  shows  a  worst  case  expansion  -  a  multiplication  by  2730  (or  2#101010101010),  which 
contains  6  discontiguous  1  bits.  In  this  example,  n  =  6,  and  a  single  multiply  is  expanded  to  6  shifts 
and  5  adds. 


0x0; 

000e7880 

sll 

t7,t6,2 

0x4  : 

01ee7821 

addu 

t7,t7,t6 

0x8: 

000£7880 

sll 

t7,t7,2 

Oxc ; 

01ee7821 

addu 

t7,t7,t6 

0x10: 

000£7880 

sll 

t7,t7,2 

0x14: 

01ee7821 

addu 

t7,t7,t6 

0x18: 

000£7880 

sll 

t7,t7,2 

0x1  c: 

01ae7821 

addu 

t7,t7,t6 

0x20: 

000£7880 

sll 

t7,t7,2 

0x24: 

0lee7821 

addu 

t7,t7,t6 

0x28: 

000£7840 

sll 

t7,t7,l 

Figure  3-9:  Mips  M/500  Code  for  Multiplication  by  2730 

For  many  users  of  the  Mips  M/500,  however,  this  scheme  presents  an  interesting  set  of  problems. 

1 .  Obviously,  when  minimizing  code  size  is  a  paramount  consideration,  multiplications 
can  cause  image  code  size  to  grow.  Since  the  speed/space  tradeoff  of  the 
assembier/reorganizer  is  weighted  on  speed,  multiply  instructions  are  allowed  to  grow 
to  1 4  times  their  original  size. 

2.  Algorithms  written  in  different  languages  may  run  at  vastly  different  speeds.  Lan¬ 
guages  may  implement  constant  values  in  different  ways;  thus,  multiplications  may  be 
implemented  in  different  ways.  Multiplying  by  the  constant  value  2  takes  substantially 
less  time  than  multiplication  by  a  variable  containing  the  value  2. 

3.  A  good  compiler  may  actually  generate  code  that  mns  slower  than  a  bad  compiler.  A 
compiler  that  compresses  arithmetic  expressions  to  eliminate  spurious  multiplies  may 
create  code  that  expands  into  a  larger  sequence  of  shifts  and  adds  than  a  compiler 
that  does  not  do  compression  (see  section  7.7). 

4.  Altering  a  constant  value  (e.g.,  a  C  fdefine  constant)  may  change  the  size  and 
speed  of  a  program,  even  though  the  variable  does  not  affect  the  number  of  iterations 
in  loops. 

Users  of  the  Mips  assembler/reorganizer  must  therefore  be  very  careful  when  generating  space- 
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critical  or  speed-critical  code.  Figure  3-10  shows  the  time  required  to  perlorm  a  multiply  by  a  con¬ 
stant  value  and  by  a  variable  using  the  mul  instruction  (the  timing  will  be  different  for  the  muio 
instruction). 


1 


Multipliar  Value 


Figure  3*10:  Relative  Multiplication  Speeds 

Notice  the  widely  varying  times  required  to  perform  the  multiplications.  The  two  curves  represent 
constant  values  ranging  from  0  to  1 00  for  the  solid  curve,  and  from  2700  to  2800  for  the  stippled 
curve.  The  solid  line  at  the  top  of  the  graph  is  tfie  time  required  to  perform  a  multiply  using  the  actual 
mul  instruction.  Note  that  the  time  required  to  execute  the  shifts  and  adds  never  exceeds  this  time. 

Mips  therefore  has  taken  pains  to  correctly  weight  the  mul  instruction  expansion  by  the 
assembler/reorganizer.  This  will  usually  result  in  faster  program  execution  (except  in  cases  similar  to 
that  shown  in  section  7.7),  although  predicting  actual  execution  time  can  be  difficult.  When  this  is  of 
critical  importance,  actual  instruction  counting  must  be  done. 

3.2.2.  Retargeting  of  Branch  Instructions 

One  interesting  effect  of  the  assembler/reorganizer  is  that  it  will  occasionally  re-target  a  branch 
instruction.  This  re-targeting  will  occur  when  at  least  the  following  three  conditions  are  true; 

1 .  The  deiay  slot  that  must  follow  the  branch  cannot  be  filled  with  an  instruction  from 
immediately  before  the  branch. 

2.  The  original  target  of  the  branch  is  not  relatively  relocatable  -  that  is,  the  target  must 
be  within  the  same  module.  Jump  instructions  that  refer  to  addresses  outside  of  the 
local  scope  are  ineligible. 

3.  The  targeted  instruction  must  not  cause  an  exception. 

When  these  conditions  are  met,  the  assembler/reorganizer  will  fill  the  delay  slot  following  the  branch 
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instruction  with  the  instruction  that  was  originally  targeted  by  the  branch,  and  will  move  the  target  of 
the  branch  to  the  next  instruction  following  the  original  branch  target.  Consider  the  example  source 


code  shown  in  figure  3-11. 

.  ent 

£00  2 

foo : 

bge 

$2,  0,  $41 

negu 

$2,  $2 

$41: 

aubu 

$24,  $18,  $17 

beq 

$24,  $2,  $43 

negu 

$3,  $3 

$43; 

£00 

.end 

£00 

Figure  3-1 1 : 

Example  of  Branch  Target  Relocation  -  Assembler  Source 

In  this  example,  the  negation  of  register  $2  is  only  performed  if  $2  is  less  than  0.  Othenwise,  a 
branch  is  executed  to  label  $4i,  which  subtracts  registers  $17  and  $18.  This  is  then  followed  by 
another  conditional  branch  and  another  negation.  One  might  expect  that  this  code  fragment  would 
yield  two  nop  instructions,  one  following  each  of  the  branch  instructions.®  However,  as  can  be  seen 
in  figure  3-12,  this  is  not  the  case. 


0x0: 

£00 : 

04410003 

bgez 

▼0,0x10 

0x4 : 

0251C023 

aubu 

t8, 82, si 

0x8: 

00021023 

atibu 

▼0, zero, vO 

Oxc: 

0251C023 

atabu 

t8, s2, si 

0x10; 

13020002 

beq 

t8,v0, Oxlc 

0x14: 

00000000 

nop 

0x18: 

00031823 

aubu 

▼1, zero,vl 

Oxlc : 

OcOOOOOO 

0 

0x20: 

00000000 

nop 

Figure  3-12:  Example  of  Branch  Target  Relocation  -  Mips  M/500  Output 


The  anticipated  second  nop  instruction  is  present  at  address  0x16,  but  the  first  delay  slot  has  been 
filled  with  the  original  target  of  the  branch  instruction,  and  the  branch  target  has  been  moved  from 
Oxc  to  0x10.  The  reader  is  encouraged  to  trace  the  control  flow  of  this  fragment  (remembering  the 
rules  of  reorganization  around  branch  Instructions)  to  verify  that  the  output  of  the 
assembler/reorganizer  is  correct.  Notice  that  location  0x4  contains  the  same  instruction  as  location 
Oxc  (the  original  target).  However,  only  one  of  these  instructions  is  ever  executed. 


Notice  that  this  reorganization  technique  does  not  reduce  the  size  of  the  program  at  all  (nor  does  it 
increase  it).  It  does,  however,  speed  up  program  execution  by  substituting  nop  instructions  with 
other  Veal*  instructions;  this  is  especially  effective  when  the  delay  slot  following  a  branch  back  to  the 
top  of  a  loop  can  be  filled  with  the  first  instruction  of  the  loop  (p-motion).  The  assembler/reorganizer 


*Th«  first  would  tw  soon  bocauM  th«  rwgats  follows  th«  brarKh  in  original  instruction  straam.  Ths  sacond  would  be 
prasant  bacausa  tha  brartch  is  oonbngant  on  tha  rasult  of  tha  subtraction,  so  the  subtract  can  not  ba  movad  aftar  tha  branch. 
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could  also  decrease  program  size  with  this  technique.  All  of  the  “come  from"  points  of  an  instruction 
are  known  to  the  assembler.®  If  an  instruction  has  no  "come  from"  points  (that  is,  no  instruction  will 
■fall  through"  to  the  instruction,  and  all  branches  have  been  retargeted),  then  that  instruction  may  be 
removed.  In  this  case,  the  instmction  at  address  Oxc  will  never  be  executed,  and  thus  it  could  be 
elided  by  the  assembler/reorganizer. 


3.3.  Local  Conclusions 

The  Mips  user-level  instruction  set  and  the  Mips  M/500  native  instruction  set  are  inherently  similar, 
though  radically  different  in  some  cases.  Evaluating  a  compiler  on  the  basis  of  the  Mips  assembly 
code  that  it  produces  would  therefore  be  a  mistake.  It  is  necessary  to  examine  the  reorganized 
native  machine  code  produced  by  the  assembler  reorganizer.  This  has  the  disadvantage  of  present¬ 
ing  to  the  reader  a  somewhat  confusing  picture,  because  some  instructions  (such  as  branches  and 
jumps)  do  not  take  effect  immediately  upon  being  scanned. 

Also,  although  most  instructions  take  a  single  cycle  to  execute,  some  instructions  (notably  the  r-.ui 
and  div  instructions,  and  co-processor  instructions)  take  more  than  a  single  cycle.  Evaluating  the 
predicted  worst  case  runtime  of  a  section  of  code  can  therefore  be  tricky,  even  without  considering 
the  effects  of  the  instruction  cache  (as  discussed  in  section  5.3).  Measurement  is  the  only  reliable 
guide,  and  even  measurements  need  careful  interpretation. 

It  also  appears  that,  wherever  the  assembler  writers  thought  it  appropriate,  special  case  code  has 
been  introduced  to  handle  operands  of  zero.  Since  the  assembler  reorganizer  is  taking  the  liberty  ol 
effectively  rewriting  the  assembly  program  into  a  functionally  equivalent,  though  structurally  different 
form,  it  is  perfectly  acceptable  to  interpret  the  constant  value  0  and  the  zero  register  as  identical. 
Unfortunately,  it  is  all  too  often  the  case  that  the  two  are  not  treated  equivalently.  This,  combined 
with  the  absence  of  many  other  special  case  tests  (such  as  checking  for  an  addend  or  dividend  of  0). 
suggests  a  non-uniform  approach  to  the  assembler  reorganizer.  It  seems  that  the  assembler  writers 
have  considered  each  special  test  in  line,  rather  than  developing  a  rigorous  solution  to  all  of  the 
special  conditions.^®  The  code  that  is  generated  by  the  assembler  reorganizer  is  correct,  although  it 
is  sometimes  suboptimal.^'* 


*A  *com«  from'  point  »  aithor  a  branch  instruction  that  axocutas  a  *po  to*  an  instruction,  or  a  prior  instruction  that  *falls 
through’  to  that  instruction. 

'**K  couU  ba  arguad  tha;  a  'good*  oompilar  would  navar  ganarata  coda  that  usas  rrtany  of  thasa  apacial  cases  (l.e., 
ganarating  coda  that  has  a  dnnsor  or  dividand  of  zaro).  It  is  usually  on  assumptions  Nka  thasa  that  catastrophas,  and  thases 
on  catastrophe  thaory,  are  basad.  Wa  discovarad  a  numbar  of  examplas  of  this  type  of  failure  in  the  course  of  our 
inveetigations. 

”Sae,  for  example,  the  differences  in  coda  expansion  for  the  seq  instruction  on  page  176.  For  this  instruction,  a  different 
sal  of  instructions  are  generated  for  a  source  of  the  zero  register  and  lor  tj-^j  constant  value  0,  even  though  the  two  are 
identical  values. 
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4.  Analysis  of  Benchmarks 

In  general,  benchmarks  set  out  to  do  two  things: 

1 .  Produce  some  gross  determination  on  the  suitability  of  using  given  compiler  generated 
code  for  a  given  processor  by  providing  some  measure  of  it’s  efficiency. 

2.  Determine  the  relative  performance  of  various  processors. 

Regrettably,  most  published  benchmarks  fail  to  achieve  these  goals,  and  instead  only  report  on  a 
given  processor’s  ability  to  run  a  specific  benchmark.  The  people  who  publish  benchmark  statistics 
for  a  given  machine  are  generally  concentrating  on  the  second  factor  only.  By  claiming  that  their 
machine  can  execute  "273  deka-Floppystones,*  they  divulge  almost  no  useful  information.  Yet  the 
notion  of  benchmarks  as  measures  of  performance  is  that  we  felt  compelled  to  present  some  statis¬ 
tics,  in  spite  of  our  feelings  about  their  inapplicability. 

The  ■‘arf’  of  benchmarking  is  still  in  the  stone  age  -  the  Whetstone  and  Dhrystone  benchmarks  were 
written  with  a  specified  mix  of  instructions  in  mind  (as  well  as  a  specific  compiler  technology),  and 
they  test  only  that  instruction  mix.  The  Dhrystone  benchmark  even  requires  that  cenain  optimiza¬ 
tions  not  be  used  vtrhen  compiling  the  benchmark  to  most  eftectively  test  the  features  for  wrhich  it  was 
designed. 

A  benchmark  really  tests  two  things: 

1 .  A  compiler’s  effectiveness  in  generating  machine  code  from  source  language. 

2.  The  hardware’s  speed  in  executing  that  code. 

These  two  parts  are  inseparable  halves  of  the  whole  -  one  may  not  eliminate  either  part,  but  must 
examine  both  the  generated  machine  code  and  the  speed  at  which  it  is  executed.  In  restricting  the 
level  of  optimization  that  may  be  used,  the  Dhrystone  benchmark  considers  only  one  half  of  the 
compiler/machine  couplet.  If  a  given  compiler  has  features  which  enable  it  to  process  source  lan¬ 
guage  in  an  efficient  way,  those  features  should  be  tested  in  the  benchmark  since  they  will  also  be 
used  in  real  life.  On  the  Mips  M/500,  these  features  include: 

•  cross-module  optimization 

•  interprocedure  register  allocation 

•  routine  in  lining  (hoisting). 

We  believe  that  these  are  valuable  compiler  functions  and  therefore  have  gathered  all  of  our  bench¬ 
mark  statistics  with  these  features  enabled. 


'*Roulin*  inlining  is  ths  proosss  o(  rsmoving  a  routins  esH  and  aubsStuling  K  with  lha  body  of  lha  roUina.  This  action  is 
also  caKad  rotAirm  homing  and  incraaaas  tha  apaad  of  a  program  by  ramoving  tha  ovarhaad  of  paramalar  passing  and  routma 
calling.  Whan  a  roubna  is  ealad  from  only  ona  plaoa  In  a  program,  routma  mining  almost  almys  rasuks  in  a  parformanoa 
Improvamanl.  Howavar  .ha  numbar  of  cal  sitas  for  a  roulina  incraasas.  tha  parformanoa  improvamarti  bagins  to  ba  offsat 
by  an  mcraaaad  program  imaga  siza.  Tha  dadsion  to  Inina  a  routino  is  usualy  basad  on  tha  numbar  of  call  sitas,  ’’-.i  siza  of 
tha  routma  body  varsus  tha  sda  of  tha  cal  and  ratum  saquarwa,  and  on  various  spadfics  of  ragistar  alocation. 


CMU/SEi-87.TR-29 


15 


We  present  in  this  chapter  the  results  and  analyzes  of  four  benchmarks. 

1.  Ackermann’s  Function  [Wichmann  76]  -  this  deceptively  simple  function  is  used  to  ex¬ 
amine  the  behavior  of  the  compiler  on  a  well-known  fundamental  problem. 
Ackermann's  Function  is  a  highly  recursive  function  that  serves  no  "useful'  purpose  in 
that  it  does  not  calculate  anything  of  importance.  However,  the  way  in  which  a  com¬ 
piler  generates  code  for  this  function  can  be  fairly  easily  reduced  to  a  pair  of  mean¬ 
ingful  numbers  We  evaluate  these  numbers  and  comment  on  their  significance. 

2.  Whetstones  [Cumow  76]  -  one  of  the  numbers  that  hardware  manufacturers  like  to 
publicize  to  show  off  their  computer’s  efficiency.  In  our  opinion,  all  that  this  benchmark 
measures  is  how  efficiently  a  compiler/computer  pair  can  execute  the  Whetstone 
benchmark  (and  not  how  fast  they  can  execute  a  real  floating-point  program).  How¬ 
ever,  since  it  is  customary  to  measure  this  aspect  of  a  computer’s  performance,  we 
provide  (again,  with  a  careful  emalysis)  the  results  of  the  Mips  M/SOO’s  performance  in 
this  benchmark. 

3.  Dhrystones  [Weicker  84]  -  another  of  the  numbers  that  is  produced  to  tout  a 
computer’s  performance.  The  Dhrystone  measure  concentrates  on  integer  operations 
of  a  mix  calculated  to  simulate  average  integer  programs.  Unfortunately,  it  presents  an 
artificial  picture  of  routine  loading  and  parameter  passing. 

4  20  Queens  -  a  small  integer-based  program  that  calculates  a  mutually  non-threatening 
placement  of  twenty  queens  on  a  20  x  20  chessboard.  This  benchmark  was  chosen 
because  it,  too,  was  small  enough  to  analyze  in  detail.  The  relative  run  times  at  the 
various  levels  of  optimization  are  presented  to  give  a  feel  for  optimizer  efficiency  on 
this  small  scale. 

We  examined  numerous  other  benchmarks.  Some  of  the  standard  ones  that  we  rejected  are: 

•  The  CMU  MCF  benchmark  suite  [Barbacci  78]  -  these  benchmarks  are  designed  to  test 
the  efficiency  of  numerous  military  processors  by  having  humans  write  the  most  efficient 
assembly  code  they  oould  to  perform  a  number  of  functions,  including: 

•  character  string  search 

•  integer  array  manipulation 

•  linked  list  insertion 

•  charaaer  to  floating-point  conversion. 

•  record  packing  and  unpacking 

These  benchmarks  were  never  executed  in  the  original  tests,  but  they  measured  the 
applicability  of  different  instruction  sets  to  these  tasks.  The  results  of  the  evaluation 
consisted  of  measuring  the  memory  and  register  usage  based  on  a  high-level  simulation 
of  the  machines  on  real  hardware  artd  not  execution  speed.  These  benchmarks  are 
much  too  small  to  consider  alone. 

•  Quicksort  -  While  this  is  a  reasonable  function  to  test  for,  the  Quicksort  algorithm  is  so 
small  that  it  does  not  really  test  the  efficiency  of  the  compiler.  Also,  it  is  somewrhat  data 
dependent,  so  that  even  a  machine-independent  set  of  data  does  not  really  test  the 
algorithm. 

•  FFT  -  Rejected  for  the  same  reason  as  Quicksort. 

•  BYTE  benchmarks  -  Rejected  lor  the  same  reason  as  Quicksort. 

•  EUUG  benchmarks  -  These  bei.v,,  .marks  showed  T  .it  the  Mips  M/500  is  useful  as  a 
workstation,  but  the  statistics  that  they  are  of  little  significance. 
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In  all  cases,  it  should  be  remembered  that  benchmarks  are  useless  unless  a  detailed  analysis  of  the 
reasons  for  their  performance  is  conducted.  Simply  presenting  a  set  of  unrelated  numbers  tells 
nothing  about  a  machine  A  benchmark's  behavior  on  a  given  machine  is  also  highly  correlated  with 
the  efficiency  of  the  compiler  that  is  generating  code  for  it,  and  ignoring  the  compiler’s  effect  ignores 
the  truth 

Readers  are  also  cautioned  to  first  read  section  5.3  before  attempting  to  generate  or  execute 
benchmarks  on  their  own.  If  is  insufficienf  to  run  a  benchmark  once  or  twice  to  determine  its  execu¬ 
tion  speed  The  graphs  shown  in  figures  5-1  and  5-2  are  the  distillation  of  data  acquired  from 
running  768  different  programs  a  total  of  4608  times.  The  graphs  shown  in  figures  5-3  and  5-4  are 
pictures  of  the  data  collected  from  520  individual  programs  executed  over  6000  times  The  primary 
reason  for  this  huge  collection  of  data  was  to  eliminate  any  special  factors  which  could  influence  the 
run  time  of  the  test  programs.  Many  factors  influence  the  execution  speed  of  a  benchmark;  simply 
asking  all  other  users  to  log  off  is  insufficient.  As  with  the  results  of  a  benchmark,  the  ancillary 
influencing  factors  must  also  be  analyzed  before  any  meaningful  results  can  be  extracted  from  the 
mass  of  data 


4.1.  Ackermann’s  Function 

Ackermann's  Function  is  a  reasonable  measure  of  the  efficiency  of  a  compiler’s  treatment  of  routine 
calls  and  the  associated  integer  arithmetic.  It  is  a  useful  benchmark  in  that  it  can  be  used  to  simply 
quantify  (without  running  the  program)  the  performance  of  a  compiler. 

4.1.1.  Method  of  Analysis 

The  first  number  that  can  be  derived  from  the  output  of  a  compiler  is  the  size  (in  bytes)  of  the 
generated  code.  This  number  gives  a  reasonable  handle  on  the  overall  efficiency  of  a  compiler, 
particularly  compared  to  other  compilers  for  machines  with  similar  instruction  set  complexity. 

The  second  number  is  a  fair  indicator  of  the  speed  of  the  generated  code.  This  number  is  the 
average  of  the  number  of  instructions  needed  to  execute  either  the  first  or  third  leg  of  the  conditional 
expression  comprising  Ackermann's  function  (see  figure  4-1  for  a  statement  of  the  function).  The 
average  of  the  first  and  third  legs  of  the  conditional  is  used  because  these  comprise  the  predominant 
run-time  load  of  the  function.''^ 

Ideally,  the  lower  the  numbers  for  both  measures,  the  better  the  compiler.  This  generalization, 
however,  can  be  misleading.  For  example,  the  VAX  calls  instruction  is  very  expensive,  yet  dever 
usage  of  it  can  reduce  the  second  measure  considerably,  at  very  little  improvement  in  run-time 
performance.  We  will  attempt  to  objectively  evaluate  the  performance  of  the  Mips  C  and  Pascal 


’*Th«  first  and  third  lags  ars  exacutad  naarty  the  same  noihber  of  times,  which  are  diaproportionataly  frequent  compared  to 
the  second  leg  For  acker(3,8),  the  first  lag  is  executed  1  .SS’  3  1  times  and  the  third  leg  is  executed  1 ,391 ,981  limes,  while 
the  second  leg  is  executed  only  2,036  times.  Since  the  aecorxl  leg  accounts  for  only  0.073%  of  the  total  function  load,  n  may 
be  ignored 


CMU/SEI-07-TR-29 


17 


aclcer  (n,m) 
( 


if  (n  =  0) 

return  m-fl; 
else  if  (m  =  0} 

return  eclcer  (n-1, 1)  ; 

else 

return  acker (n-1 , acker (n,m-l)}; 


) 


Figure  4-1 :  C  Source  Code  for  Ackerman n’s  Function 
compilers  and  contrast  them  to  comparable  compilers  on  comparable  architectures.^^ 


function  acker (n,m  :  integer)  :  integer; 
begin 

if  n  a  0  then 

acker  m+1 

else  if  m  s  0  then 

acker  acker (n-1, 1) 

else 

acker  acker  (n-1,  acker  (n,in-l) )  ; 

end; 

Figure  4-2;  Pascal  Source  Code  for  Ackermann’s  Function 


4.1.2.  Analysis  of  C  and  Pascal 

The  C  and  Pascal  source  code  for  Ackermann’s  Function  is  shown  in  figures  4-1  and  4-2,  respec¬ 
tively.  The  assembly  language  output  of  the  C  compiler’®  (seen  in  figure  4-3)  shows  a  total  byte 
count  of  92  (23  instructions  of  4  bytes  each),  with  a  mean  instruction  count  of  14.’®  To  show  how 
this  latter  number  is  arrived  at,  we  have  added  the  tag  for  instructions  that  are  executed  when 
the  first  leg  of  the  conditional  is  executed,  and  the  tag  "[3]"  tor  those  that  are  executed  when  the 
third  leg  is  executed. 


The  numbers  for  C  compare  quite  favorably  with  the  other  architectures  and  compilers  evaluated  by 
Wichmann.  Given  that  the  architecture  of  the  Mips  M/500  is  RISC  in  nature,  these  values  are  very 
good  (in  fact,  they  are  quite  respectable  for  CISC  architectures,  too).  However,  these  accolades 
must  be  held  in  abeyance  for  a  little  while.  As  discussed  in  section  3.2,  the  instructions  that  are 
emitted  by  the  code  generator  are  not  necessarily  the  instructions  that  are  executed  by  the  Mips 
M/500.  Since  the  Mips  M/500  native  instruction  set  is  not  identical  to  the  Mips  assembly  language, 
we  must  use  the  disassembler  to  look  at  the  actual  machine  language  image  before  we  can  come  up 
with  accurate  values  for  the  Ackermann  Function  analysis. 


Figure  4-4  shows  the  actual  Mips  M/500  native  machine  code  that  is  executed  for  Ackermann's 
Function  when  compiled  with  the  C  compiler  at  optimoation  level  2. 


'^n«n  WTichmann  hu  aceumulatad  many  maaauramanta  of  tha  coda  panaratad  for  Ackarmann's  function  in  [Wichmann 
82].  Rafaranoas  to  othar  compilarB  ara  from  that  raport. 

'*This  axampla  was  eompUad  with  tha  -O  switch,  which  salacts  optimization  laval  2.  There  is  no  extra  benefit  for 
optimization  levels  3  or  4  for  a  simpla  function  Nka  this. 

**Thara  ara  11  instructions  in  tha  first  lag,  17  in  tha  third,  with  an  average  of  (Il4l7)/2«14. 
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« 

1 

acker (n,m) 

# 

2 

{ 

acker : 

aubu 

$sp,  24 

11]  [3] 

aw 

$31,  20($sp) 

[1]  13] 

aw 

$16,  16($8p) 

[1]  (3] 

iBove 

$16,  $4 

[1]  [3] 

atove 

$3,  $5 

11]  13] 

« 

3 

if 

(n  0) 

bne 

$16,  0,  $32 

[1]  13] 

# 

4 

return  nH-l; 

addu 

$2,  $3,  1 

[1] 

b 

$34 

[1] 

« 

5 

elae  if  (m  =  0) 

$32: 

bne 

$3,  0,  $33 

[3] 

« 

6 

return  acker (n- 

■1,1)  ; 

addu 

$4,  $16,  -1 

li 

$5,  1 

acker 

b 

$34 

« 

7 

else 

« 

8 

return  acker (n- 

■1,  acker  (n,in-l) ) 

$33; 

move 

$4,  $16 

[3] 

addu 

$5,  $3,  -1 

[3] 

j*i 

acker 

[3] 

addu 

$4,  $16,  -1 

13] 

move 

$5,  $2 

13] 

acker 

[3] 

$34: 

Iw 

$16,  16($sp) 

tl]  [3] 

Iw 

$31,  20($sp) 

[1] [3] 

addu 

$sp,  24 

[1]  13] 

j 

$31 

11]  13] 

Figure  4-3:  Assembly  Output  from  the  C  Compiler 

In  this  case,  we  come  up  with  a  total  byte  count  of  96  (24  instructions  of  4  bytes  each),  with  a  mean 
instruction  count  of  14.5.^^  To  show  how  this  latter  number  is  arrived  at,  we  have  again  added  the 
tag  "[1]"  for  instructions  that  are  executed  when  the  first  leg  of  the  conditional  is  executed,  and  the 
tag  ”[3]"  for  those  that  are  executed  when  the  third  leg  is  executed.’®  The  counts  have  increased 
somewhat,  although  not  markedly.  Still,  because  the  instructions  that  are  actually  executed  by  the 
Mips  M/500  are  different  from  those  emitted  by  the  code  generator,  one  must  be  careful  when 
evaluating  the  expected  run-time  of  any  program.  In  this  case,  the  time  increase  is  a  little  more  than 
3.5%,  but  there  are  cases  in  which  a  single  Mips  assembly  language  instruction  will  be  expanded  to 
12  or  more  times  its  original  size  when  converted  to  Mips  M/500  native  instructions.  (See  the  table  of 
instruction  conversions  starting  on  page  146  for  more  details  on  this  feature  of  the  assembler 
reorganizer.) 

The  values  shown  in  tables  4-1  and  4-2  show  the  code  size  and  average  number  of  instructions 
executed  for  the  C  and  Pascal  versions  of  Ackermann's  Function  at  varying  levels  of  optimization. 

ara  12  inatruclions  in  lha  first  lag,  17  in  tha  third,  with  an  avaraga  of  (124^17)/2>14.5 

'*Saa  taction  3.1  for  an  axplanation  at  to  why  tha  instructions  attar  tha  branchat  ara  ttiil  contidarad  in  tha  instruction 
counts 
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acker : 

0x0: 

27b<lffe8 

addiu 

ap,ap, -24 

[1] 

[3] 

0x4  : 

afb£0014 

aw 

ra,20(ap) 

11] 

[3] 

0x8: 

afbOOOlO 

aw 

aO,  16  <ap) 

[1] 

[3] 

Oxc: 

00808021 

aiove 

aO,  aO 

[1] 

[3] 

0x10 

16000003 

bne 

aO, aero, 0x20 

[1] 

[3] 

0x14 

00a01821 

move 

vl.al 

[1] 

[3] 

0x18 

lOOOOOOe 

b 

0x54 

11] 

Oxlc 

24620001 

addiu 

vO ,  vl ,  1 

11] 

0x20 

14600007 

bne 

vl, aero, 0x40 

[3] 

0x24 

02002021 

move 

aO,  aO 

[3] 

0x28 

2604£££f 

addiu 

aO, aO,-l 

0x2c 

OcOOOOOO 

acker 

0x30 

24050001 

li 

al.l 

0x34 

10000008 

b 

0x58 

0x38 

8£b£0014 

Iw 

ra,20(ap) 

0x3  c 

02002021 

Btove 

aO,aO 

0x40 

OcOOOOOO 

acker 

[3] 

0x44 

2465££££ 

addiu 

al,vl,-l 

[3] 

0x48 

2604££££ 

addiu 

aO, aO, -1 

13] 

0x4c 

OcOOOOOO 

j«l 

acker 

[3] 

0x50 

00402821 

Biove 

al,  vO 

[3] 

0x54 

8£b£0014 

Iw 

ra,20(8p) 

[1] 

[3] 

0x58 

8£b00010 

Iw 

a0,16(ap) 

[1] 

[3] 

0x5  c 

03e00008 

ra 

[1] 

[3] 

0x60 

27bd0018 

addiu 

ap,ap,24 

[1] 

[3] 

Figure  4-4;  Mips  M/500  Native  Machine  Code  for  Ackermann’s  Function 


Optimization  Level 

-oo 

-01 

-02 

-03 

-04 

Byte  Count 

192 

176 

100 

100 

100 

Instruction  Average 

27 

18.5 

14.5 

14.5 

14.5 

Table  4-1 :  C  Compiler  Efficiency  Measures  Using  Ackermann’s  Function 


Optimization  Level 

-oo 

-Ol 

-02 

-03 

-04 

Byte  Count 

200 

152 

108 

108 

108 

Instruction  Average 

29 

21 

16 

16 

16 

Table  4-2:  Pascal  Compiler  Efficiency  Measures  Using  Ackermann’s  Function 
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The  marked  improvement  for  both  compilers  for  optimization  level  2  over  optimization  0  demon¬ 
strates  conclusively  the  positive  effects  of  an  optimizer  for  even  simple  programs  like  this.  In  fact, 
even  optimization  level  1  causes  a  noticeable  shrinkage  in  code  size  and  execution  count/'®  Op¬ 
timization  levels  3  and  4  (which  are  fairly  sophisticated)  do  not  have  any  effect  on  programs  that  are 
this  simple.^® 

acker : 


0x0: 

27bd£fe0 

eddiu 

ap, ap, -32 

0x4  : 

efbfOOlc 

aw 

ra,28(8p) 

0x8: 

«£b00014 

aw 

a0,20(sp) 

Oxc : 

afbl0018 

aw 

al,24(sp) 

0x10 

00808021 

move 

aO,  aO 

0x14 

00aC3021 

move 

a2,  al 

0x18 

16000003 

bne 

aO, zero, 0x28 

Oxlc 

00408821 

move 

al,  vO 

0x20 

10000012 

b 

0x6c 

0x24 

24C30001 

addiu 

vl,a2, 1 

0x28 

14C00008 

bne 

a2, zero, 0x4c 

0x2  c 

02002021 

move 

aO,  aO 

0x30 

2604££f£ 

addiu 

aO, sO, -1 

0x34 

24050001 

li 

al,l 

0x38 

OcOOOOOO 

j*l 

acker 

0x3c 

02201021 

move 

vO,  si 

0x40 

lOOOOOOa 

b 

0x6c 

0x44 

00401821 

move 

vl,  vO 

0x48 

02002021 

move 

aO,  aO 

0x4c 

24c5££££ 

addiu 

al, a2, -1 

0x50 

OcOOOOOO 

j«l 

acker 

0x54 

02201021 

move 

vO,  si 

0x58 

2604££££ 

mddxu 

a0,s0,-l 

0x5  c 

00402821 

move 

al,  vO 

0x60 

OcOOOOOO 

j*i 

acker 

0x64 

02201021 

move 

vO,  si 

0x68 

00401821 

move 

vl,  vO 

0x6c 

00601021 

move 

vO,  vl 

0x70 

8£b00014 

Iw 

80,20(sp) 

0x74 

8£b£001c 

Iw 

ra,28(sp) 

0x78 

8£bl0018 

Iw 

al,24(sp) 

0x7c 

03e00008 

ra 

0x80 

27bd0020 

addiu 

ap,8p,32 

Rgure  4-5:  Mips  M/500  Machine  Language  Output  from  Figure  4-2 


’^Optimization  (aval  1  «  parformad  by  dafault,  unleas  axpKcitly  awitebad  off. 

^Ona  optimization  that  would  hava  an  affact  on  this  program  is  tail  racursion  alimination.  Tha  Mips  optimizer  does  not  do 
this  par'inilar  type  of  optimization.  Whan  this  and  other  hand  optimizations  are  parformad  on  this  module  (specifically, 
improvmg  tha  calling  convention  used  by  this  routing  and  radudng  the  antry/aorit  protocol),  tha  byte  count  can  be  reduced  to 
72.  and  tha  average  instruction  count  to  9  (sea  section  4.1 .4). 
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The  values  for  C  are  noticeably  better  than  those  for  Pascal, even  though  we  would  predict  that  at 
level  4  optimization,  there  should  be  no  difference  between  the  two.  The  C  code  generator  is 
apparently  able  to  take  advantage  of  the  tact  that  the  values  returned  do  not  need  to  be  assigned  to 
intermediate  storage  locations,  while  the  Pascal  compiler,  being  constrained  to  always  store  the 
return  value  of  a  function,  does  not.  The  Pascal  compiler  could  have  achieved  equally  good  results 
by  not  making  the  mistake  of  using  two  registers  (specifically  vO  and  vi)  to  hold  return  values  and 
intermediate  results,  and  then  needing  to  perform  a  register  shuffle.  This  shortcoming  can  be  cor¬ 
rected  by  a  better  register  tracking  and  assignment  algorithm  in  the  Pascal  compiler. 

4.1.3.  Analysis  of  BCPL 

The  same  analysis  was  performed  for  the  BCPL  version  of  Ackermann's  Function.  The  BCPL 
source  code  is  shown  in  figure  4-6. 

LET  acker (m, n)  s 

ttpaO  ->  n+1, 

n=0  ->  acker (a-1, 1) , 

acker (m-1 , acker (m, n-1) ) 

Rgure  4-6;  BCPL  Source  for  Ackermann’s  Function 

The  output  of  the  BCPL  compiler  is  shown  in  figure  4-7,  with  the  same  tagging  notation  used  in  the 
earlier  examples.  The  front  end  of  the  compiler  has  performed  code  hoisting  of  the  function  return 
sequence  so  that  each  of  the  three  arms  ends  with  a  direct  return  (instead  of  branching  to  the  return 
sequence  as  is  done  with  C  and  Pascal). 

A  total  of  27  instructions  is  needed  to  implement  the  function.  The  average  number  of  instructions 
per  call  is  (6+17)/2  or  11.5.  The  compiler  has  only  one  (fairly  low)  level  of  optimization.  Even  so, 
these  numbers  are  fairly  good.  However,  figure  4-7  shows  the  unreorganized,  high-level  Mips  code. 
The  reorganizer  changes  it  to  whaf  is  shown  in  figure  4-8. 

The  actual  Mips  M/500  native  code  uses  32  instructions,  or  128  bytes  of  code  to  implement  the 
routine.  The  mean  number  of  instructions  per  call  is  (8+20)/2  or  14.  Note  that  the  code  could  be 
improved  by  a-motion  of  the  code  at  0x28  and  0x48,  and  by  eliminating  the  nop  at  0x24. 


^'Wh«n  the  C  compiler  ie  coerced  into  recognizing  common  sube;^ree*ions  by  changing  the  code  to  read: 
aolMc(n,a) 

{ 

if  (D  »  0) 
return 

elae 

return  aoker(n-l,  ?  1  :  eofcer  (n,a-l} }  ; 

) 

ttw  total  code  size  meaaure  improvaa  even  more  (reducing  K  to  86  bytea  of  code).  While  this  last  optimization  does  nothing  to 
improve  the  execution  speed  of  the  routines  (in  fact,  K  hinders  K  somewhat,  raising  the  average  number  of  instructions 
executed  to  15),  it  does  reduce  the  overall  size  of  the  routine.  This  optimization,  carried  out  over  larger  programs,  will  have 
the  effect  of  reducing  overall  program  size,  and  hence,  the  amount  of  paging  the  system  needs  to  perform.  Since  run-time 
may  be  increased,  however,  it  is  up  to  the  program  implementors  to  decide  where  the  tradeoff  is  to  be  made.  Ideally,  both  the 
C  and  Pascal  compilers  should  recognize  this  inherenf  commonality,  and  should  produce  somewhat  smaller  code  with  no 
irtcrease  in  execution  speed  The  results  shown  here,  however,  are  still  very  favorable. 


CMU/SEI-e7-TR-25 


#  holds  1 

#  Ocode  stack  pointer 


I 


#define  uO  $2 
#define  ul  $3 
#define  rz  $0 
#de£ine  ru  $16 
#define  rp  $22 
#de£ine  zl  $31 


ZAl: 

aw 

rl,  0  (rp) 

[1] 

t3] 

ad 

uO,  8  (rp) 

11] 

t3] 

bne 

uO, rz, 1A3 

tl] 

t3] 

add 

uO ,  ul , ru 

11] 

Iw 

rl,  0 (rp) 

tl] 

j 

rl 

tl] 

IJ13: 

bne 

ul, rz, IAS 

t3] 

aub 

uO , uO , ru 

move 

ul,  ru 

add 

rp,  16 

bal 

lAl 

Iw 

rl,  (0-16)  (rp) 

8\ib 

rp,  16 

j 

rl 

1J15: 

aub 

u0,u0,ru 

t3] 

aw 

uO, 24 (rp) 

t3] 

aub 

ul , ul , ru 

t3] 

Iw 

uO,  8  (rp) 

t3] 

add 

rp,  28 

t3] 

bal 

lAl 

t3] 

move 

ul,  uO 

t3] 

Iw 

uO,  (24-28)  (rp) 

t3] 

aub 

rp,  12 

13] 

bal 

lAl 

13] 

Iw 

rl,  (0-16)  (rp) 

[3] 

aub 

rp,  16 

13] 

j 

rl 

13] 

Figure  4-7:  Assembly  Language  Output  from  BCPL  Compiler 

4.1.4.  Results  of  Hand  Coding  In  Mips  Assembly  Language 

The  hand  coded  version,  which  is  the  best  we  can  currently  come  up  with,  is  seen  in  figure  4-9.  This 
is  reorganized  into  what  we  see  in  figure  4-10. 

The  hand  optimized  code  results  in  18  instructions  (or  72  bytes),  and  the  mean  number  of  instruc¬ 
tions  per  call  is  (4-i-14)/2  or  9.  This  is  a  substantial  improvement  over  the  Mips  C  compiler  and  the 
BCPL  compiler,  and  reflects  the  power  that  a  compiler  could  achieve  with  the  proper  optimizations 
The  special  optimizations  that  were  done  are: 

•  Tail  recursion  elimination 

•  Procedure  call  protocol  elimination 

•  Writing  the  Mips  assembly  language  to  eliminate  all  possible  nop  instructions.  It  would 
have  been  easier  to  write  directly  in  the  Mips  M/500  native  instruction  set,  but  this  option 
was  not  available  to  us. 
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0x0: 

aedfOOOO 

sw 

ra, 0 (86) 

[1] 

13] 

0x4 : 

aec20008 

8W 

vO, 8 (86) 

[1] 

[3] 

0x8: 

14400005 

bne 

vO, zero, 0x20 

[1] 

13] 

Oxc: 

«ec3000c 

8tr 

vl,12(s6) 

[1] 

[3] 

0x10 

SedfOOOO 

Iw 

ra, 0 (s6) 

[1] 

0x14 

00701020 

add 

vO, vl, 80 

[1] 

0x18 

03e00008 

ra 

[1] 

Oxlc 

00000000 

nop 

[1] 

0x20 

14600009 

bne 

vl, zero, 0x48 

[3] 

0x24 

00000000 

nop 

13] 

0x28 

00501022 

8Ub 

vO, vO, 80 

0x2c 

02001821 

move 

vl,  80 

0x30 

0411£f£3 

bgezal 

zero, 0x0 

0x34 

22d60010 

addi 

86,86, 16 

0x38 

8edf£££0 

Iw 

ra, -16 (86) 

0x3c 

22d6££f0 

addi 

86,86,-16 

0x40 

03e00008 

jr 

ra 

0x44 

00000000 

nop 

0x48 

00501022 

8ub 

vO, vO, aO 

[3] 

0x4c 

aec20018 

8W 

vO, 24(86) 

[3] 

0x50 

00701822 

8ub 

vl, vl, 80 

[3] 

0x54 

8«c20008 

Iw 

vO,  8(86) 

[3] 

0x58 

0411££e9 

bgezal 

zero, 0x0 

[3] 

0x5  c 

22d6001c 

addi 

86,86, 28 

13] 

0x60 

00401821 

move 

vl,  vO 

13] 

0x64 

8ec2£££c 

Iw 

v0,-4(86) 

13] 

0x68 

0411££e5 

bgezal 

zero, 0x0 

[3] 

0x6c 

22d6£££4 

addi 

86, 86, -12 

13] 

0x70 

8ed££££0 

Iw 

ra,-16(86) 

13] 

0x74 

22d6£££0 

addi 

86, 86, -16 

[3] 

0x78 

03e00008 

jr 

ra 

13] 

0x7c 

00000000 

nop 

[3] 

Figure  4-8;  Machine  Language  Output  from  Figure  4-7 


4.1.5.  Comparison 

Table  4-3  gives  the  results  for  each  language  analyzed.  It  shows  the  total  size  of  the  function  in 
bytes,  the  average  number  of  instructions  per  call,  and  the  time  to  execute  acker  (3, 8) .  In  all 
cases,  the  figures  are  for  the  reorganized  code  (i.e.,  the  native  Mips  M/500  code),  not  the  high  level 
assembly  language. 

4.1.6.  Local  Conclusions 

Ail  things  considered,  the  measures  obtained  by  analyzing  the  compilers'  treatment  of  Ackermann's 
Function  are  quite  favorable,  especially  for  a  RiSC-style  architecture.  Clearly,  the  Mips  compilers 
and  hardware  are  on  to  something.  However,  the  hand  optimizations  shown  in  section  4.1.4  in¬ 
dicates  that  there  is  still  a  large  degree  of  improvement  that  can  be  obtained.  RISC  architectures, 
espedally  when  pipelined,  can  be  tricky  machines  to  generate  code  by.  Mips  has  done  a  very 
reasonable  first  pass  at  the  development  of  a  set  of  good  compilers,  (especially  for  a  system  that  has 
been  developed  ex  nihllo)  and  has  demonstrated  that  a  RISC  architecture  is  a  good  choice.  While 
the  measure  of  Ackennann’s  Function  is  only  one  indication  of  the  effidency  of  a  compiler  and 
ha.  ware  combination,  the  results  shown  here  are  very  promising.  Nonetheless,  Mips  needs  to 
apply  additional  effort  in  their  compiler  team. 


24 


CMU/SEI-87-TR-25 


IJll : 

SM 

$31,0 ($22) 

* 

no  need  bo  store  paraaieters  yet 

bne 

$2,$0,IJL3 

add 

$2, $3, $16 

j 

$31 

« 

Return 

IA3: 

sub 

$2, $2, $16 

« 

moved  up  to  fill  branch  slot 

bne 

$3,$0,LA4 

move 

$3, $16 

b 

lAl 

« 

Call  (tail  recursion  elimination) 

IJ14: 

sw 

$2, 8 ($22) 

« 

save  across  inner  call 

sub 

$3, $3, $16 

add 

$2, $2, $16 

add 

$22,12 

« 

true  call  follows  -  stust  aiove  stack 

bal 

ua 

move 

$3,  $2 

Iw 

$2,  (8-12)  ($22) 

Iw 

$31,  (0-12)  ($22) 

sub 

$22,12 

« 

should  fill  branch  slot 

b 

lAl 

« 

tail  recursion  elimination 

Figure  4-9:  Hand  Optimized  Version  of  Acketmann’s  Function 


0x0; 

14400003 

bne 

vO, sero, 0x10 

[1] 

[3] 

0x4 : 

aedfOOOO 

sw 

ra,0(s6) 

[1] 

[3] 

0x8; 

03e00008 

jr 

ra 

tl] 

Oxc: 

00701020 

add 

vO, vl, sO 

11] 

0x10 

14600003 

bne 

vl, sero, 0x20 

[3] 

0x14 

00501022 

sub 

vO, vO, sO 

13] 

0x18 

1000fff9 

b 

0x0 

0x1  c 

02001821 

move 

vl,  sO 

0x20 

00701822 

sub 

vl, vl, sO 

[3] 

0x24 

aec20008 

sw 

vO, 8 (s6) 

13] 

0x28 

00501020 

add 

vO, vO, sO 

[3] 

0x2c 

0411fff4 

bgezal 

sero, 0x0 

[3] 

0x30 

22d6000c 

addi 

s6,s6,12 

[3] 

0x34 

00401821 

move 

vl,  vO 

[3] 

0x38 

8ec2fffc 

Iw 

v0,-4(s6) 

13] 

0x3c 

8adffff4 

Iw 

ra,-12  (s6) 

[3] 

0x40 

lOOOffef 

b 

0x0 

13] 

0x44 

22d6fff4 

addi 

s6,s6,-12 

[3] 

Figure  4-1 0:  Machine  Language  Output  for  Rgure  4-9 


Language 

Function  Size 
(bytea) 

Inatructlons 
per  Call 

Execution  Time 
Ofac)cer(3,8} 
(aeconda) 

C 

too 

14.5 

5.6 

Pascal 

108 

16 

9.0 

BCPL 

128 

14 

6.1 

Assembler 

72 

9 

3.8 

Table  4-3:  Summary  of  Statistics  For  Ackemri'ann's  Function 
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4.2.  Whetstone  Benchmark 

The  Whetstone  benchmark  is  an  artificial  benchmark,^  contrived  to  measure  the  floating  point  per¬ 
formance  of  a  machine.  Being  an  artificial  benchmark,  the  results  of  this  benchmark  are  of  question¬ 
able  value  in  analyzing  the  true  performance  of  the  Mips  M/500.  Typically,  the  best  test  for  the 
performance  of  a  machine  is  a  real  application  program,  such  as  the  Spice  benchmark.  However, 
since  practically  all  machine  comparisons  include  the  results  of  the  Whetstone  benchmark,  we  felt  it 
would  be  appropriate  to  include  it  in  our  analysis.  The  actual  source  code  of  the  benchmark  is  not 
reproduced  here. 

4.2.1.  Method  of  Analysis 

The  analysis  of  the  data  generated  by  the  Whetstone  benchmark  is  usually  interpreted  as  a 
straightforward  measure  of  the  hardware’s  efficiency  in  performing  floating  point  calculations.  How¬ 
ever,  in  truth  there  is  a  much  more  subtle  interaction  v^h  the  source  language  and  the  compiler's 
optimizing  capabilities  than  most  people  would  admit.  One  might  assume  that  since  the  tests  per¬ 
formed  by  this  benchmark  are  so  simple,  the  choice  of  language  and/or  compiler  might  not  make  a 
difference.  In  fact,  however,  we  will  show  that  there  are  large  differences,  even  on  one  machine  with 
languages  supplied  by  one  manufacturer. 

To  portray  the  Mips  M/500  as  accurately  as  possible,  the  Whetstone  benchmark  was  executed  in 
three  languages  (Fortran,  C,  and  Pascal),  with  both  single  and  double  precision  floating  point 
variables,  at  all  levels  of  optimization. 

4.2.2.  Results 

Table  4-4  shows  the  results  obtained  for  the  three  languages  at  the  highest  level  of  optimization 
(along  with  the  values  for  the  Vax  with  the  Berkeley  4.3  compiler  for  comparison  purposes).  The 
larger  the  value  for  the  Whetstone  benchmark,  the  greater  the  machine/compiler  performance. 

Clearly,  the  Mips  M/500  is  a  fast  processor  with  a  fast  floating  point  board.”  It  is  unclear  why  the 
values  for  the  different  languages  differ  so  much.  To  more  clearly  visualize  these  differences,  ex¬ 
amine  figure  4-1 1 .  From  the  upward  slope  of  the  graph,  it  is  clear  that  the  extra  optimization  levels 
cause  the  program  to  run  faster. 

It  is  unclear  viriiy  each  language's  double  precision  floating  point  performance  is  roughly  equal  (the 
solid  markers  on  the  graph),  but  the  single  precision  performance  is  very  different  (the  open  markers 


**By  "•rtHiciar  w»  mean  that  tha  Whatstona  banchmark  is  a  program  that  consists  of  a  number  of  small  modules  that  test 
floating  point  behavior  for  a  number  of  high  level  language  features  that  presumably  map  to  low  level  hardware  features  on 
the  machinefs)  for  which  the  benchmark  was  originally  contrived.  Since  the  Mii»s  M/500  architecture  and  corr^silers  are 
potentially  very  different  from  the  original  target  architecturefs).  the  results  derived  from  running  the  Whetstone  benchmark  are 
questionable  at  best 

**The  Mips  M/500  orns  from  nearly  4  times  taster  than  the  MicroVax  <for  double  precision  C)  to  over  6  times  faster  than  the 
MicroVax  (for  dr  .-hle  precision  FoRTnaN).  It  should  also  be  noted  that  these  results  were  obuined  with  the  Wytek  floating 
point  board  sup  >d  by  Mips.  This  board  is  not  running  at  full  speed,  but  rather  is  fully  emulating  the  Mips  M/500  floating  point 
chip  currently  under  development  When  the  floating  point  chip  is  completed,  Mips  predicts  a  substantial  performance 
improvement. 
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Mips  C  (Single  Precision) 

1494.71 

Mips  C  (Double  Precision) 

1730.94 

Mips  Fortran  (Single  Precision) 

2408.67 

Mips  Fortran  (Double  Precision) 

1615.22 

Mips  Pascal  (Single  Precision) 

1912.76 

Mips  Pascal  (Double  Precision) 

1734.02 

Vax  C  (Single  Precision) 

386.45 

VAX  C  (Double  Precision) 

372. 19 

VAX  Fortran  (Single  Precision) 

414.31 

VAX  Fortran  (Double  Precision) 

381.92 

Table  4-4:  Whetstone  Numbers  for  Mips  and  Vax 
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Rgure  4-1 1 :  Whetstone  Benchrrark  Performance 

on  the  graph).  One  explanation  is  the  non-zero  origin  effects  seen  in  the  graph  -  the  curves  do  not 
seem  nearly  so  spread  out  on  a  graph  whose  origin  is  at  zero.  In  the  following  sections,  we  discuss 
other  factors. 

4.2^.  1.  Analysis  of  C  Results 

For  the  C  version,  the  reasons  for  the  differences  are  easy  to  explain  -  the  semantics  of  C  require 
that  all  single  precision  variables  be  promoted  to  double  precision  before  calculations  are  performed, 
and  then  demoted  back  to  single  precision  after  the  calculations  are  completed.  This  strategy  made 
sense  in  the  original  PDP-11  implementation  of  C.  where  double  precision  variables  took  twice  as 
much  space  to  store  .ingle  precision  variables,  and  the  extra  expense  of  conversion  was  worth 
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paying  in  exchange  for  being  able  to  store  twice  as  many  floating  point  variables  in  a  limited  (64K 
bytes)  address  space.  In  modem  computers  with  32  bit  address  space  and  large  virtual  memories, 
this  strategy  makes  little  sense.  However,  the  semantics  of  the  language  are  firmly  established,  and 
this  accounts  for  the  poorer  performance  of  the  single  versus  double  precision  Whetstone  numbers 
in  C, 

The  Mips  C  compiler  provides  a  compile-time  switch  to  disable  this  conversion, and  the  perfor¬ 
mance  of  this  compilation  is  reported  as  'C  Forced  Single’  in  figure  4-11.  Single  precision  still 
performs  less  efficiently  than  does  double  precision  -  a  wholly  unpredicted  result. 

The  explanation  is  somewhat  complex.  Certainly,  the  compiler  does  do  some  forcing  of  operands  to 
float,  which  accounts  for  some  of  the  speedup  between  this  and  the  single  precision  version  that 
must  convert  each  operand  to  double  precision.  There  are,  however,  some  cases  where  it  cannot 
help  but  convert: 

1 .  Since  there  is  only  one  version  of  the  transcendental  functions  in  the  G  library,  and 
since  these  routines  return  double  precision  results,  the  compiler  must  perform  conver¬ 
sions  of  the  returned  values  back  to  single  precision.  The  -float  switch,  therefore, 
does  not  affect  routine  return  values.  Additionally,  the  run-time  library  is  performing 
double  precision  operations,  even  though  the  main  body  of  code  is  only  using  single 
precision. 

2.  Since  C  does  not  have  a  forward"  declaration  that  includes  the  types  of  the 
parameters  to  a  routine,  C  promotes  all  character  and  short  parameters  to  type  integer, 
and  more  importantly,  all  single  precision  floating  point  parameters  to  double  precision 
(this  is  to  maintain  correct  stack  alignment  on  machines  that  pass  parameters  on  the 
stack).  This  is  regu/red  behavior  for  the  transcendental  functions,  and  (as  a  general 
statement)  semantically  required  behavior  for  ail  subroutines.  The  -float  switch, 
therefore,  does  not  affect  parameter  passing. 

3.  Since  all  floating  point  parameters  are  passed  as  double  precision  values,  the  compiler 
performs  double  precision  operations  on  those  parameters  where  appropriate  (i.e., 
when  it  can  avoid  a  conversion).  When  it  cannot  avoid  a  conversion,  it  promotes  the 
single  precision  components  to  double  precision  (to  avoid  loss  of  accuracy). 

Approximately  46%  of  the  total  tests  in  the  benchmark  perform  double  precision  arithmetic  opera¬ 
tions,  convert  parameters  and  return  values,  or  call  double  precision  transcendental  functions  even 
when  the  -float  compiler  option  is  used  (loops  n7,  n8,  and  nil  of  the  benchmark).  Thus,  even 
though  the  C  version  of  the  Whetstone  benchmark  is  able  to  give  some  performance  improvement 
when  compiled  with  the  -float  option,  it  must  of  necessity  incur  some  penalties  by  converting 
values  from  single  to  double  precision  and  vice  versa.  This  is  not  the  fault  of  the  Mips  C  compiler, 
but  rather  the  fault  of  the  language  semantics  (to  which  the  Mips  C  compiler  is  faithfully  adhering). 
The  only  suggestion  that  we  have  to  Mips  is  that  they  change  their  documentation  to  reflect  the  true 
behavior  of  their  compiler,  instead  of  stating  the  r/Tfended  behavior. 


**This  is  the  -float  option,  and  is  documented  as  causing  the  compiler  to  never  promote  floats  to  doubles 
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A.2.2.2.  Analysis  of  Fortran  Results 

With  the  problem  of  the  C  compiler  differences  resolved,  the  question  now  remains,  "Why  is 
Fortran  single  precision  performance  so  good?*,  or,  aKemalively,  "Why  is  Pascal  single  precision 
performance  so  bad?’  The  answer  to  this  dual  question  is  that  the  Fortran  compiler  is  good,  and 
that  the  Pascal  compiler  is  bad  when  it  comes  to  single  versus  double  precision  performance. 

When  we  examined  the  assembler  output  of  the  Fortran  compiler,  we  found  that  the  double  preci¬ 
sion  and  single  precision  versions  were  nearly  identical.  The  only  differences  were  the  substitution 
of  double  versus  single  precision  opcodes,  a  double  of  data  sizes  for  double  precision,  and  different 
run  time  library  routines  being  called  for  transcendental  functions.  It  is  not  surprising,  therefore,  that 
Fortran  should  perform  as  well  as  it  does  for  the  single  precision  Whetstone  benchmark.  25  The 
Fortran  compiler  is  not  trying  to  do  anything  'clever*  with  either  single  or  double  precision  numbers, 
and  so  does  not  rob  itself  of  any  performance. 

4.2.2.3.  Analysis  of  Pascal  Results 

Pascal,  on  the  other  hand,  does  not  perform  as  well  as  Fortran,  nor  does  it  perform  as  poorly  as 
C.  The  reason  for  this  is  quite  simple.  The  Pascal  compiler  is  not  bound  by  the  same  conversion 
semantics  as  the  C  compiler,  so  Pascal  is  able  to  generate  much  more  efficient  single  precision 
operations.  Additionally,  since  Pascal  has  a  'fonward*  declaration  that  defines  the  type  of  both  the 
function  parameters  and  its  return  value,  Pascal  can  intelligently  assign  registers  and  stack  locations 
for  parameters,  and  not  suffer  by  forang  all  parameters  to  be  double  precision. 

However,  Pascal  (as  opposed  to  Fortran)  has  only  one  version  of  all  of  its  transcendental  functions, 
and  this  version  operates  in  double  precision  mode  writh  double  precision  parameters.  Pascal  must 
therefore  convert  single  precision  parameters  to  double  precision,  and  convert  the  function  result 
back  to  single  precision  (and  incur  the  expense  of  calculating  the  transcendental  function  in  double 
precision  mode). 2® 

When  the  Whetstone  benchmark  was  altered  to  eliminate  the  transcendental  functions  (benchmark 
loops  n7  and  nil),  Pascal  performed  within  5%  of  Fortran. 

4.2.3.  Local  Conclusions 

The  evaluation  of  the  floating  point  performance  of  the  Mips  M/500  (and  indeed,  of  any  machine)  can 
be  strongly  biased  by  the  performance  of  the  compilers.  One  cannot  take  statistics  such  as  the 
Whetstone  benchmark  too  seriously  until  one  a  thoroughly  aneUyzed  the  compiler  that  was  used  to 
generate  the  tests.  The  choice  of  language  and  compiler  can  radically  affect  the  test  results. 

It  is  with  this  in  mind  that  we  again  maintain  that  bad  software  can  always  be  the  downfall  of  good 
hardware.  In  some  cases,  the  fault  lies  with  the  designers  of  the  language,  and,  in  others,  with  the 


^FonTRAN  does  not  improve  as  noticeably  as  Pascal  or  C  for  level  4  optimization  because  Fortran  programs  cannot  be 
readily  subjected  to  the  routine  hoisting  (see  page  15)  that  takes  place  with  this  optimization  level.  Fortran  semantics  state 
that  all  local  variables  are  statically  allocated,  so  routine  hoisting  becomes  a  difficult,  R  not  nearly  impoesbie,  task. 

**This  was  merely  an  oversight  on  the  part  of  the  implementors  of  Mips  Pascal,  and  is  not  an  intrinsic  feature  of  the 
language. 
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implementors  of  the  compilers.  Although  the  Mips  M/500  shows  great  promise  in  the  area  of  pure 
performance  (specifically,  with  the  Fortran  compiler),  some  work  must  be  done  to  bring  the  com¬ 
piler  technology  for  all  languages  up  to  a  common  level.  For  Pascal,  this  means  modifying  the 
run-time  libraries  to  include  single  precision  transcendental  functions.  For  C,  this  means  paying 
closer  attention  to  when  conversions  are  necessary,  and  when  double  precision  operations  are  more 
expensive  than  converting  to  single  precision  and  performing  single  precision  operations. 

In  general,  a.'l  languages  currently  available  on  the  Mips  provide  similar  power  for  double  precision 
For  raw  single  precision  floating  point  performance,  however,  it  seems  that  for  all  its  shortcomings, 
Fortran  is  currently  the  language  of  choice. 


4.3.  Dhrystone  Benchmark 

The  Dhrystone  benchmark  is  another  artificial  benchmark,  constructed  to  measure  the  integer  perfor¬ 
mance  of  a  machine.  Since  it  is  an  artificial  benchmark,  the  results  of  this  benchmark  are  of  ques¬ 
tionable  value  in  analyzing  the  true  performance  of  ttie  Mips  M/500.  As  artificial  benchmarks  go, 
however,  the  Dhrystone  benchmark  appears  to  be  a  fairly  reasonable  test.  One  argument  against  it 
is  that  it  has  a  fairly  high  percentage  of  routine  calls,  which  unfairly  biases  the  results  against  those 
machines  with  an  expensive  procedure  call  interface.  However,  since  practically  all  machine  com¬ 
parisons  include  the  results  of  the  Dhrystone  benchmark,  we  felt  it  would  be  zqipropriate  to  include  it 
in  our  analysis.  The  actual  source  code  of  the  benchmark  is  not  reproduced  here. 

4.3.1 .  Method  of  Analysis 

The  analysis  of  the  data  generated  by  the  Dhrystone  benchmark  is  usually  interpreted  as  a 
straightforward  measure  of  the  hardware’s  effiaency  in  performing  integer  calculations.  However,  in 
truth  there  is  a  much  more  subtle  interaction  with  the  source  language  and  the  compiler’s  optimizing 
capabilities  than  most  sources  would  admit.  One  feature  of  the  Mips  compilers  that  serves  it  in  good 
stead  with  this  benchmark  (and,  of  course,  with  real-life  applications  programs),  is  its  ability  to  do 
routine  hoisting  (see  page  15).  This  is  especially  true  for  this  benchmark,  which  has  a  high  percent¬ 
age  of  procedure  calls  relative  to  actual  computation. 

To  portray  the  Mips  M/500  as  accurately  as  possible,  the  Dhrystone  benchmark  was  executed  in  C 
at  all  levels  of  optimization. 

4.3.2.  Results 

Table  4-5  shows  the  results  obtained  for  the  three  languages  at  the  highest  level  of  optimization 
(along  with  the  values  for  the  Vax  with  the  Berkeley  4.3  compiler  for  comparison  purposes).  The 
larger  the  value  for  the  Dhrystone  benchmark,  the  greater  ihe  machine/compiler  performance. 

Again,  the  benchmark  results  demonstrate  that  the  Mips  M/500  is  a  fast  machine,  docking  in  at  over 
10  times  faster  than  the  Micro  Vax.  However,  a  lot  of  the  Mips  speed  (or  actually,  the  VAX’s  lack  of 
speed)  is  attributable  to  the  high  percentage  of  routine  calls  used  in  this  benchmark.  The  Berkeley 
compiler  uses  the  calls  linkage  exdusively,  even  though  it  is  a  very  expensive  subroutine  linkage. 

The  most  interesting  aspect  of  this  benchmark  is  the  performance  as  the  level  of  optimization  is 
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Mips  C  (Register) 

14184 

Mips  C  (Non-Register) 

14167 

Vax  C  (Register) 

1394 

Vax  C  (Non- Register) 

1380 

Table  4-5:  Dhrystone  Numbers  for  Mips  and  Vax 


-OO  -01  -02  -03  -04 

Optimization  Level 


Figure  4-12:  Dhrystone  Benchmark  Performance 

increased.  As  shown  in  figure  4-12,  as  the  level  of  optimization  is  increased  from  level  0  (i.e.,  no 
optimization)  to  level  4  (the  highest  level).  The  optimizer  nearly  doubles  the  performance  of  the 
non-register  Dhrystone  benchmark. 

It  is  also  interesting  to  note  that  the  C  compiler  is  faithfully  observing  the  benchmark's  request  to  put 
certain  variables  in  registers.  This  is  reflected  in  the  fact  that  the  register  version  of  the  benchmark 
runs  faster  than  the  non-register  version  with  low  level  optimization.  However,  when  the  optimization 
level  rises  to  level  2  (the  first  serious  set  of  optimizations  that  are  performed),  the  compiler  ignores 
the  benchmark's  requests,  and  decides  for  itself  which  variables  belong  in  registers.  The  net  effect 
is  to  immediately  bring  the  non-register  version  of  the  benchmeirk  up  to  par  with  the  register  version. 
This  effectively  proves  that  an  automatic  register  allocator  can  be  just  as  good  (or  better)  a  judge  of 
which  variables  belong  in  registers. 

The  code  size  of  the  Mips  M/500  is  only  25%  larger  than  on  the  MicroVAX  when  at  optimization  level 
2.  This  sort  of  code  size  expansion  due  to  the  more  reduced  instruction  set  complexity  of  the  Mips 


ic  alto  worthy  of  note  that  avan  with  all  optimizationa  tumad  off,  tha  Mips  M/500  is  still  able  to  exacuta  the  Dhrystor>a 
banuhmark  over  5  timas  faatar  than  tha  opUmizad  ragistar  version  on  the  MicroVAX. 
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M/500  is  predicted.  However,  when  optimization  level  4  is  used  (i.e.,  routine  hoisting),  the  size  of  the 
Mips  M/500  executable  image  is  3%  smaller  than  that  of  the  Vax!  This  is  a  clear  example  of  the 
desirability  of  a  powerful  compiler,  and  further  exemplifies  the  applicability  of  a  RISC  architecture  in 
any  area. 

4.3.3.  Local  Conclusions 

Although  the  Whetstone  benchmark  leaves  some  room  for  improvement,  the  Dhrystone  benchmark 
results  show  unequivocally  that  the  concept  of  a  RISC  architecture  is  a  viable  one.  The  use  of 
routine  hoisting  is  a  very  valuable  optimization,  and  the  increase  in  performance  obtained  by 
simplifying  the  routine  interface  is  substantial. 


4.4.  The  Eight  Queens  Problem 

In  order  to  introduce  some  non-artificial  benchmark  statistics  into  the  test  suite  for  the  Mips  machine, 
the  classic  problem  of  the  Eight  Queens  was  generated.  In  its  pure  form,  the  Eight  Queens  problem 
is  to  find  a  placement  for  eight  queens  on  an  8  x  8  chessboard  such  that  no  piece  threatens  any 
other  in  a  static  placement.  The  problem  may  be  generalized  for  the  placement  of  n  queens  on  an 
n  X  n  chessboard.  Although  there  are  no  solutions  for  n  *  2  or  n  =  3,  there  exist  solutions  for  n  =  4 
through  at  least  n  =  26.  To  bring  execution  time  to  a  reasonable  level  (the  complexity  of  the  algo¬ 
rithm  is  O(n^),  we  chose  n  =  20  as  our  board  size  for  running  this  benchmark. 

The  problem  was  solved  using  two  similar  algorithms,  seen  in  the  same  body  of  code  in  figure  4-13. 
The  conditional  compilation  bounded  by  the  compile  time  constant  abs  selects  which  method  of 
examining  the  squares  diagonal  to  the  location  are  to  be  tested.  If  the  constant  expression  abs  is 
FALSE,  the  diagonals  are  examined  directly,  first  the  left  side,  and  then  the  right.  If  the  constant 
expression  abs  is  true,  the  diagonals  are  examined  simultaneously  through  the  use  of  the  abs  o 
routine.  While  the  latter  method  shortcuts  some  evaluation,  we  would  expect  this  version  of  the 
program  to  run  slightly  slower  because  of  the  additional  overhead  of  a  routine  call.  The  actual  values 
of  the  run-times  in  this  test  are  unimportant,  since  we  are  not  interested  in  using  this  test  as  a 
benchmark  against  other  processors.  What  is  important  is  the  relative  speeds  of  the  two  algorithms 
at  the  various  optimization  levels. 

As  shown  in  figure  4-14,  the  direct  examination  of  the  diagonals  generally  runs  faster  then  the 
routine  call  examination.  The  slight  anomaly  at  optimization  level  1  can  be  attributed  to  the  fact  that 
level  1  optimizations  are  quite  simple,  and  do  very  little  by  way  of  flow  analysis.  In  any  case,  since 
optimization  level  1  is  billed  as  'all  the  optimizations  that  can  be  done  quickly',  the  optimizer  cannot 
be  faulted  for  inadequately  optimizing  one  version  of  the  program.  In  fact,  the  only  difference  be¬ 
tween  the  level  0  and  level  1  optimizations  for  this  example  is  the  removal  of  extraneous  assembler 
labels.  This  rerTX>val  allows  the  assembler  reorganizer  to  better  manipulate  the  assembly  output,^ 


^*Th«  aMamblar  reorganizer  does  not  (or  cannot)  consider  twro  adjacent  labels  as  being  identical.  Consequently,  K  u> 
unable  to  move  instructions  around  a  pair  of  labels,  and  the  reaulting  executable  image  is  larger. 
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#ifde£  ABS 

int  abs  (i)  int  i; 

{ 

return 

) 

#endi£ 
main  () 

register  int  r,  i,  j,  low,  high; 
int  row [20] ; 

for  (i  e  0;  i  <  20;  i++) 
row[i]  w  -1; 
r  *s  0; 

while  (r  <  20)  (  /*  Main  loop  */ 

if  (•f+row[r]  =  20)  /*  Nothing  can  go  on  this  row  */ 

if  (r  «=  0) 

break;  /*  Failure  -  no  solution  */ 

else  { 

row[r — ]  =  -1;  /*  Reset  current  row  (for  later)  */ 

continue;  /*  Back  up  and  try  again  */ 

} 

for  (i  =  r-1;  i  >«  0;  i — )  ( 
if  (row[i]  ®=s  row[r]) 

break;  /*  Test  vertical  */ 

#ifdef  ABS 

if  (abs (row[r] -row[i] )  *=  r-i) 

break;  /*  Test  both  diagonals  */ 

false 

if  (row[r] -row[i]  «S5  r-i) 

break;  /*  Test  left  diagonal  */ 

if  (row [i] -row [r]  r-i) 

break;  /*  Test  right  diagoxial  */ 

fendif 

) 

if  (i  <  0)  i*  Loop  coo^leted,  no  collisions  */ 

r++; 

) 

) 

Rgure  4>13:  Source  Oxle  to  20  Queens  Problem 

and  causes  the  major  influence  on  program  run-time  to  evidence  itself.^  When  more  substantial 
optimizations  are  performed  at  level  2  (spedficaiiy,  the  elimination  of  redundant  code),  the  direct 
examination  once  again  performs  better  than  the  routine  call  examination. 

For  this  simple  test,  there  is  no  difference  in  execution  speed  between  optimization  levels  2, 3,  and  4 
for  the  direct  examination  of  the  diagonals  (the  extra  optimization  sinply  has  no  effect  on  a  program 


**ln  this  csss,  it  is  ths  four  array  accassss  for  diract  aaamination  of  ths  diagonals  Msrsus  two  array  acostsss  for  ths  routine 
call  examination  of  the  diagonals.  Without  any  common  eubexprseeion  elimination,  this  results  in  tour  multiplies  atxf  four  adds 
for  dirset  examination  tersua  two  multiplies,  two  adds,  and  a  routine  call  for  the  routine  call  examination.  The  latter  code  is 
faster  in  this  case,  sirwe  multiply  instructions  are  very  eocperwive  (see  section  3.2.1). 
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Rgure  4-14:  Runtime  of  20  Queens  Placement  at  Differing  Optimization  Levels 

that  is  this  simply  and  tightly  coded).  However,  a  noticeable  difference  occurs  at  optimization  level  4 
for  the  absolute  value  routine-call  examination  of  the  diagonals.  At  this  level,  the  optimizer  hoists  the 
absolute  value  routine  into  the  main  body  of  code,  which  results  in  a  faster  run-time  performance. 


The  hoisted  code  that  examines  the  diagonals  with  an  absolute  value  routine  still  oins  slower  than 
the  direct  examination  of  the  diagonals  because  the  optimizer  (and  the  assembly  reorganizer)  still 
have  some  troubles  with  register  tracking. 


#  49  if  (ro»»[r] -row[i]  *=  r-i)  /*  Te»t  left  diegoiutl  */ 

•ubu  $3,  $18,  $17 

•ubu  $24,  $4,  $2 

beq  $24,  $3,  $41 

#  50  break; 

4  51  if  (row [i] -row [r]  n  r-i)  /*  Teat  right  diagonal  */ 

aubu  $25,  $2,  $4 

beq  $25,  $3,  $41 

«  52  break; 

Figure  4-15:  Direct  Examination  of  Diagonals 

As  shown  in  figure  4-15,  the  direct  examination  of  the  diagonals  results  in  20  bytes  of  code  being 
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generated.  However,  as  figure  4-16  shows,  the  hoisted  code  uses  32  bytes  of  code.®®  Since  the 
remainder  of  the  code  generated  for  these  two  cases  is  identical,  the  extra  bytes  of  code  are  directly 
responsible  for  the  performance  difference. 


#  46  if  (abs (row [r] -row [1] ) 


subu 

$2,  $4,  $3 

bge 

$2,  0,  $41 

negu 

$3,  $2 

b 

$42 

$41: 

move 

$3,  $2 

$42: 

move 

$2,  $3 

subu 

$24,  $18,  $17 

beq 

$24,  $3,  $43 

#  47 

break ; 

=  r-i) 


/*  Test  both  diagonals  */ 


Figure  4-16:  Hoisted  Routine  Examination  of  Diagonals 


The  optimizer  is  having  some  difficulty  in  tracking  the  usage  of  registers  2  and  3,  especially  since  the 
move  instruction  immediately  following  the  label  $42  serves  no  purpose  (since  the  value  in  register  2 
is  not  used  anywhere  else  in  the  routine).  A  more  efficient  extraction  of  this  routine  shown  in  figure 
4-17.  In  the  latter  case,  the  size  of  the  assembly  language  routine  is  again  20  bytes  of  code.®’ 


#  46 


$41: 


#  47 


if  (abs (row[r] -row(i] ) 


subu 

$2,  $4,  $3 

bge 

$2,  0,  $41 

negu 

$2,  $2 

subu 

$24,  $18,  $17 

beq 

$24,  $2,  $43 

break; 

r-i) 


/*  Test  both  diagonals  */ 


Figure  4-17:  Hand  Optimized  Hoisted  Routine  Examination  of  Diagonals 


^hese  code  size  values  are  somewhat  misleading,  since  they  measure  the  size  ot  the  assembly  code,  and  not  the  size  of 
the  actual  executable  image.  The  assembler  reorganizer  must  occasionally  insert  nop  instructions  into  the  code  stream.  The 
actual  size  of  the  code  in  Figure  4-1 5  is  28  bytes,  while  the  size  of  the  code  in  figure  4-1 6  is  36  bytes. 


subu 

vl, a2, si 

subu 

v0,e0,  vl 

subu 

tS, eO, vO 

bgez 

vO,Ox40O2eS 

b  ? 

t8,vl,0x4002b0 

BOVO 

vl,  vO 

nop 

b 

0x4002x8 

subu 

t9,vO,eO 

subu 

vl,  seco, vO 

beq 

t9,vl,0x4002b0 

move 

vl,  vO 

nop 

subu 

t8,s2,sl 

beq 

t8,vl,0x4002o0 

move 

w0,vl 

Machine  code  for  direct  Machine  code  for  hoisted 

examination  (28  bytea)  routine  (36  bytes) 


’'The  actual  executable  image  size  is  now  really  28  bytes.  This  means  that  by  eliminating  the  needless  register  shuffle,  the 
hoisted  code  is  now  slightly  faster  than  the  original  inline  evaluation  of  the  diagonals,  which  is  what  we  would  expect.  This  is 
due  to  the  fact  that  roughly  half  of  the  time  cow [r]  -cow[i]  is  positive,  so  not  all  28  bytes  of  code  are  executed  at  each 
pass. 


CMU/SEI-87-TR-29 


35 


4.4.1.  influence  of  Assembler  Reorganizer 

Although  routine  hoisting  is  a  valuable  optimization,  the  combination  of  the  code  generator  and  the 
assembly  reorganizer  has,  in  this  case  as  in  others,  deleteriously  affected  the  quality  of  the  ex¬ 
ecutable  image. 


0x400290 

0x400294 

0x400298 

0x40029c 

0x4002a0 

0x4002a4 

0x4002a8 


00831023 

subu 

vO, aO, vl 

04410003 

bgez 

vO, 0x4002x4 

0251c023 

subu 

t8, 82, si 

00021023 

subu 

vO, zero, vO 

0251c023 

subu 

t8, s2, si 

13020004 

beq 

t8,v0, 0x4002b8 

00000000 

nop 

Figure  4-18:  Machine  Language  Output  for  Hand  Optimized  Code  in  Figure  4-17 


Figure  4-18  shows  the  actual  machine  instructions  generated  for  the  code  in  figure  4-17.  Notice  that 
the  instructions  at  location  0x400298  and  0x4002a0  are  identical.  As  outlined  in  section  3.2.2,  the 
assembly  reorganizer  has  filled  in  the  nop  instruction  that  follows  the  bgez  with  the  instruction  that 
was  originally  targeted  by  the  branch  (the  subu  instruction  at  0x4002a0  calculating  r-i),  and  has 
moved  the  target  of  the  branch  to  the  next  instruction  that  follo\«s  the  original  target  (to  0x4002a4). 
While  this  behavior  is  entirely  correct,  the  reorganizer  has  missed  the  fact  that  the  moved  instruction 
may  be  removed  from  its  original  location,  reducing  the  size  of  and  increasing  the  speed  of  the  final 
executable  image.  It  could  be  argued  that  the  the  subu  instruction  cannot  be  deleted  because  it 
immediately  follows  a  label.  However,  because  the  assembler  knows  of  all  the  jumps  and  branches 
that  target  the  label,  it  can  easily  determine  that  the  instruction  is  removable.  In  this  case,  only  a 
single  branch  targets  that  label.  (See  chapter  7  for  further  discussion  on  this  and  other  reorganizer 
drawbacks). 

#  46  if  (abs (row[r] -row[i] )  =  r-i) 

subu  $2,  $4,  $3 

subu  $24,  $18,  $17 

bge  $2,  0,  $41 

negu  $2,  $2 

$41: 

beq  $24,  $2,  $43 

#  47  break;  /*  Test  both  diagoiials  */ 

Figure  4-19:  Further  Modification  of  Hoisted  Code 


When  the  assembly  language  output  (shown  in  figure  4-17)  is  modified  again  so  that  the  calculation 
of  r-i  is  moved  before  the  branch  (effectively  forcing  a  different  reorganization  strategy),  the  result¬ 
ing  executable  image  is  generated  more  intelligently  with  a  size  of  only  24  bytes  (not  all  of  which  are 
always  executed)  and  a  concomitant  speed  improvement  (seen  figures  4-19  and  4-20. 

0x400290:  00831023  subu  ▼0,s0,vl 

0x400294:  04410002  bg«s  ▼0,0x4002x0 

0x400298:  0251c023  subu  t8,s2,sl 

0x40029c:  00021023  subu  vOrXttro.vO 

0x4002x0:  13020004  bxq  t8,v0, 0x4002b4 

0x4002x4 :  00000000  nop 

Rgure  4-20:  Machine  Language  Output  of  Further  Optimization  in  Figure  4-19 
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4.4.2.  Local  Conclusions 

The  routine  hoisting  optimization  performed  by  the  Mips  compiler  back  end  can  be  tremendously 
effective  in  reducing  the  overall  execution  time  of  programs.  However,  as  shown  in  the  simple 
example  above,  some  work  still  needs  to  be  done  on  the  register  tracking  algorithms  in  the  code 
generator,  and  in  the  expression  tracking  algorithms  in  the  assembly  reorganizer.  The  ability  to 
consider  two  adjacent  labels  as  being  identical  targets  for  jumps  and  branches  would  also  be  a 
desirable  feature. 
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5.  Hardware  Effects  on  Program  Performance 

In  this  chapter,  we  describe  our  measurements  of  the  hardware's  interaction  with  simple  software 
constructs.  Initially,  we  set  out  to  ask  what  would  be  the  time  required  to  perform  a  routine  call. 
However,  as  our  work  progressed,  we  discovered  that  there  was  not  a  straightforward  answer  to  the 
question.  Rather,  it  was  dependent  on  the  instruction  cache  (which  on  our  version  of  the  hardware  is 
16  K  bytes)  and  how  the  host  operating  system  (Unix)  places  programs  in  virtual  memory.  In  the 
following  sections,  we  describe  the  compiler’s  interaction  with  these  features  and  interpret  our 
results. 


5.1.  Routine  Call  Overhead 

One  tidbit  of  information  about  a  machine  and  its  compilers  is  how  fast  it  can  execute  a  routine  call. 
On  the  VAX,  the  answer  depends  on  the  type  of  routine  linkage  that  is  used  (i.e.,  jsb  or  caiix),  the 
number  of  local  registers  used  by  the  routine,  how  many  parameters  it  is  passed,  and  the  instruc¬ 
tions  that  are  used  to  pass  them  (i.e.,  pushr,  pushax,  pushx,  or  movx).  If  one  uses  the  callx 
routine  linkage,  the  answer  is  ‘very  expensive",  no  matter  what  the  other  factors.  This  is  due  to  the 
fact  that  while  the  callx  linkage  is  very  easy  to  use  from  an  assembly  language  standpoint  (and 
also  from  a  compiler  standpoint),  it  is  a  complex  instmction  that  incurs  a  great  deal  of  overhead, 
whether  or  not  any  of  the  special  features  are  used. 

The  Mips  machine  has  three  instructions  for  subroutine  calls;  bgezal,  jai,  and  jair.  The  last  two 
are  the  most  commonly  used  (in  fact,  as  discussed  in  section  6.1. 1.2  the  Mips  compiler  suite  does 
not  generate  the  bgezal  instruction^^).  These  two  instructions  are  relatively  simple.  They  store  the 
return  address  in  register  31  and  jump  to  the  specified  address  ( jal  is  a  jump  to  an  address,  while 
jalr  is  a  jump  indirectly  through  a  register).  We  were  interested  in  discovering  how  long  a  simple 
routine  call  would  take,  given  a  specified  number  of  parameters.  In  our  test  cases,  the  target  routine 
was  a  dummy  routine  that  did  nothing  (although  the  compiler  still  generated  code  to  save  the  actual 
parameters  on  the  stack). 

Our  test  cases  were  broken  up  into  a  number  of  classes.  First,  we  subdivided  the  test  programs  into 
the  number  of  parameters  that  we  would  pass  into  a  routine.  This  number  was  varied  from  0  through 
1 5  parameters.  Next,  we  tested  for  the  type  of  parameter.  Since  our  examples  were  constructed 
using  c,  we  used  all  of  the  types  available  to  the  language  (which  corresponded  nicely  to  the  types 
available  in  the  Mips  M/500  hardware):  char,  short,  int,  long,  float,  and  double.  We  also 
used  pointers  to  each  of  these  types  of  variables.  Finally,  to  round  out  the  problems,  we  examined 
the  compiler’s  behavior  with  each  of  the  four  possible  variable  allocation  classes:  local,  global, 
local-own  (i.e.,  local  scope  declared  static),  and  global-own  (i.e.,  global  scope  declared  static). 
We  generated  768  different  programs  (using  an  automatic  program  generation  test  bed)  and  ex¬ 
ecuted  each  one  a  number  of  times. 

Each  program  consisted  of  a  loop,  executed  2048  times,  surrounding  512  calls  to  a  routine.  We 


proposed  Mips  LISP  implementation  will  use  it. 
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used  a  large  high  number  of  routine  calls  so  that  they  would  substantially  outweigh  the  overhead  of 
the  loop.  When  multiple  parameters  were  passed  to  a  routine,  the  actual  parameters  were  rotated 
through  the  set  of  formal  parameters  to  eliminate  the  possibility  of  any  special  optimizations  that  the 
compiler  might  have  for  detecting  common  subexpressions.^ 

Initially,  our  study  discovered  the  following  items: 

1 .  The  first  four  parameters  to  a  routine  are  passed  in  registers  4  through  7  (or  register 
aO  through  a3;  see  table  A-1).  The  remaining  parameters  are  passed  on  the  stack 
(the  reorganizer  has  an  interesting  part  to  play  in  this  convention;  see  figures  5-1  and 
5-2).  Passing  4  parameters  in  registers  is  wise.  Most  routines  are  called  with  3 
parameters  or  fewer.^^ 

2.  All  integer  data  types  (i.e.,  char,  short,  irit.  and  long)  took  the  same  amount  of 
time  to  pass  as  parameters  to  the  test  routine.  This  is  because  the  ib,  ih,  and  iw 
instructions  all  execute  in  a  single  cycle  (plus  a  single  delay  slot). 

3.  All  address  data  types  (i.e.,  a  pointer  to  any  of  char,  short,  int,  long,  float,  or 
double)  took  the  same  amount  of  time  to  pass  as  parameters  to  the  test  routine.  This 
is  because  ail  addresses  are  loaded  using  the  la  instruction. 

4.  Passing  floating-point  parameters  took  longer  than  passing  integer  parameters.  This  is 
due  to  the  interactions  and  synchronization  between  the  Mips  M/500  CPU  and  the 
floating-point  co-processor.  Although  no  nop  instructions  are  in  evidence  in  the  object 
code,  there  are  implicit  delays  whenever  data  is  passed  from  one  processor  to  another. 

5.  Double  precision  floating-point  parameters  took  less  time  to  pass  than  single  precision 
floating-point  parameters.  This  was  an  artifact  of  the  c  language  calling  convention, 
which  requires  that  single  precision  numbers  be  converted  to  double  precision  in 
routine  calls.  This  effect  is  also  discussed  in  section  4.2.2. 1 . 

6.  Passing  local  variables  took  less  time  than  passing  global  or  statically  allocated  vari¬ 
ables.  Local  variables  are  usually  stored  in  registers,  and  passing  them  as  parameters 
requires  a  register  move  (i.e.,  1  CPU  cycle).  Global  and  statically  allocate  variables 
are  stored  in  main  memory  and  must  be  loaded  into  a  register  (i.e.,  1  CPU  cycle  plus  a 
delay  slot).  Multiple  loads  can  be  overlapped,  but  the  last  load  required  one  extra 
cycle  to  fill  the  delay  slot.®® 

7.  In  our  test  cases,  passing  an  address  as  a  parameter  took  less  time  than  passing  a 
value.  This  result  is  misleading,  though,  since  the  optimizer  was  able  to  recognize  the 
addresses  we  were  passing  as  common  sub-expressions  and  translate  that  knowledge 
into  a  reduced  complexity  program,  in  actual  practice,  passing  an  address  takes  the 


*^Wh«n  the  number  of  parameters  was  15,  the  optimizer  used  over  14  M  bytes  of  memory  while  trying  to  optimize  the  code. 
This  resulted  in  literally  millions  of  page  faults  for  each  separate  compilation,  and  a  flurry  of  complaints  directed  towards  Mips 
Inc.  When  main  memory  was  increased  from  4  Mb  to  6  Mb,  the  nunrber  of  page  faults  (and  the  compHation  time)  decreased 
tirarkedly.  However,  for  a  compiler  to  use  14  Mb  of  data  space  to  optimize  35  Kb  of  code  is  uncalled  for.  This  translates  to  a 
data  expansion  of  400  : 1 ,  or  over  26  Kb  of  optimizer  menrory  for  each  line  of  souroe  code.  We  admit  that  this  example  is  an 
unusual  one,  arxl  that  typical  optimizer  memory  usage  is  not  this  high.  However,  this  is  one  example  that  we  hold  in  disfavor 
when  evaluating  the  Mips  compiler  suite. 


**Of  the  nearly  1000  individual  routirres  declared  in  the  three  integer  applicabons  in  section  63.1,  orrly  half  a  dozen  (less 
than  1%)  of  them  had  more  than  4  formal  parameters.  The  vast  majority  of  them  had  2  or  less  parameters.  This  fiirding 
dosely  correlates  with  the  results  in  (Cook  82],  (OePryeker  82],  fTannenbaum  78],  and(Zeigler  83],  who  report  0.9,  2.1, 
1 .5/2.0,  and  1 .3  average  parameters  per  roubrw,  respectively. 


**The  delay  slot  could  be  filled  with  the  jal  instruction,  but  then  the  delay  slot  for  that  instruction  could  not  be  filled.  See 
section  3  for  more  information  on  delay  slots. 
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same  amount  of  time  as  (for  statically  allocated  variables)  or  longer  (for  register 
variables)  than  passing  a  value  parameter.  Compare  the  expansions  of  the  la  instruc¬ 
tion  on  page  146  with  that  of  the  iw  instruction  on  page  148  and  recall  that,  to  con¬ 
struct  the  address  of  a  register  variable,  the  value  of  that  variable  must  first  be  stored 
in  a  memory  location  on  the  stack,  requiring  an  extra  sw  instruction. 

We  discuss  a  number  of  other  interesting  phenomena  in  the  following  sections. 


5.2.  Reorganizer  Effects  on  Parameter  Passing 

When  passing  local  parameters  to  a  routine,  the  Mips  compilers  generate  move  or  li  instructions  to 
move  the  first  four  parameters  into  the  argument  registers,  and  store  instructions  to  push  the  remain¬ 
ing  parameters  on  the  stack.  Thus,  one  would  expect  there  to  be  two  breakpoints  in  the  time  versus 
number  of  parameters  graph  -  between  0  and  1  parameter,  and  between  4  and  5  parameters. 
However,  as  can  be  shown  in  figure  5-1 ,  it  takes  the  same  amount  of  time  to  call  a  routine  with  one 
integer  parameter  as  it  does  to  call  it  with  none.^® 


Figure  5*1 :  Routine  Overhead  for  Local  Integer  Parameters 

The  predicted  breakpoint  in  the  curve  occurs  between  4  and  5  parameters.  Yet  1  parameter  takes 
no  longer  to  peiss  than  zero.  The  reason  for  this  lies  in  the  assembler  reorganizer.  A  simple  routine 
with  no  parameters  is  called  with  a  jai  instruction,  which,  according  to  the  Mips  M/500  hardware 
constraints,  has  a  singie  delay  slot  following  it.  Thus,  a  simple  call  takes  two  CPU  cycles  to  execute. 


’*Th«  higMow  bar*  on  tho  graph  indicato  th«  maximum  varianco  betwoon  different  runs  of  the  test  programs,  while  the  line 
indicates  the  average.  We  are  oortoentrating  on  the  average  time  now,  and  will  discuss  the  variance  in  section  5.3.  The 
actual  times  are  irrelevant,  since  our  test  programs  are  contrived  examples  and  do  not  represent  real-life  examples. 
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However,  a  local  value  dass  parameter  is  passed  by  executing  a  move  into  argument  register  aO, 
and  this  instruction  can  be  moved  into  the  delay  slot  of  the  jai. 

Thus,  a  routine  call  with  a  single  parameter  takes  no  longer  than  a  call  with  none.  Both  require  2 
CPU  cycles  to  execute,  but,  in  the  former  case,  the  second  cycle  is  spent  executing  a  nop,  while  in 
the  latter,  it  is  spent  executing  a  move. 

When  global  (or  statically  allocated)  variables  are  passed  as  parameters,  another  discrepancy  be¬ 
tween  predicted  and  actual  results  occurs.  As  shown  in  figure  5-2,  the  second  breakpoint  in  the 
curve  occurs  between  5  and  6  parameters  not  between  4  and  5  parameters  as  predicted. 


Rgure  5-2;  Routine  Overhead  for  Global  Integer  Parameters 

The  reason  for  this  effect  is  similar  to  that  seen  in  figure  5-1,  except  that  in  this  case,  the  effect  is 
delayed.  The  first  four  global  value  parameters  are  loaded  into  registers  with  the  ib,  ih,  or  iw 
instructions.  Each  load  instruction  takes  one  cyde  plus  one  delay  slot.  However,  each  delay  slot 
except  the  last  is  filled  with  the  next  load  instruction,  and  the  delay  slot  for  the  last  load  instruction  is 
filled  by  the  jai  instruction  (the  delay  slot  of  the  jai  instruction  is  then  of  necessity  a  nop 
instruction).  Thus,  when  there  are  from  0  through  4  global  value  parameters  passed  into  a  routine, 
each  extra  parameter  requires  one  extra  instruction  cyde  to  peiss  it.  Note  that  the  jai  delay  slot 
cannot  be  by  a  parameter  load  (which  has  its  own  delay  slot),  since  the  first  instruction  of  the  called 
routine  might  access  that  parameter. 

Unlike  the  first  four  parameters,  which  are  simply  loaded  into  registers,  the  fifth  and  following 
parameters  must  also  be  pushed  onto  the  stack  with  an  sw  instruction.  Thus  each  extra  parameter 
past  4  requires  two  instructions  to  pass  H,  and  the  slope  of  the  curve  for  these  instructions  should  be 
twice  what  it  is  for  the  first  4  parameters.  However,  when  we  examine  the  object  code  for  5 
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parameters,  we  notice  that  the  delay  slot  for  the  jai  is  filled  with  the  sw  for  the  fifth  parameter. 
Since  this  delay  slot  was  previously  a  nop  instruction,  passing  a  fifth  parameter  has  effectively  taken 
only  a  single  instruction  more  than  passing  four  parameters  (even  though  two  extra  instructions  are 
actually  executed).  When  the  sixth  parameter  is  passed,  two  instructions  are  required,  and  since  the 
delay  slot  of  the  jai  is  already  filled,  twice  the  work  needs  be  done  to  load  this  and  subsequent 
parameters. 

Thus,  the  assembly  reorganization  needed  to  satisfy  the  Mips  M/500  pipeline  actually  benefits  sub¬ 
routine  parameter  passing  by  delaying  the  effects  of  adding  extra  parameters  to  subroutines.  In 
general,  these  benefits  will  manifest  themselves  regardless  of  the  type  of  parameter  that  is  passed, 
since  the  benefits  are  derived  for  both  local  and  global  value  parameters,  and  at  the  low  and  high 
end  of  the  number  of  variables. 


5.3.  Effects  of  Instruction  Caching 

As  we  said  in  the  previous  section,  figures  5-1  and  5-2  show  not  only  the  average  run-time  for 
various  procedure  call  overheads,  but  also  the  variance  across  different  runs  of  the  same  program. 
The  fact  that  the  run-times  varied  at  all  was  discovered  accidentally.  We  ran  each  program  6  times 
and  generated  a  graph  from  the  results  because  there  were  some  wild  aberrations  in  the  graph. 
When  we  re-ran  the  tests,  we  got  a  very  different  graph  with  different  abnormalities. 

At  first  we  thought  that  the  discrepancies  were  due  to  girtches  in  the  CPU  time  accounting  of  the 
Mips  Unix  system.  We  rejected  this  idea  when  we  observed  the  following: 

1 .  Successive  runs  of  the  same  program  gave  almost  perfectly  consistent  run  times,  but  if 
other  programs  were  run  in  between  tests,  the  CPU  time  varied. 

2.  The  actual  amounts  of  CPU  time  required  to  run  a  given  test  case  did  not  fluctuate  in  a 
continuous  spectrum,  but  fell  into  a  limited  set  of  quanta. 

3.  The  run-times  of  the  very  large  and  very  small  tests  cases  did  not  vary  much,  but 
run-times  of  the  the  medium  sized  test  cases  varied  considerably. 

Figure  5-3  shows  the  distribution  of  the  run-times  needed  by  the  various  programs.  Each  curve  on 
the  graph  represents  a  different  number  of  parameters  being  fed  to  the  test  routine.  The  x  axis  is  the 
CPU  time  required  to  execute  the  program,  and  the  x  axis  is  the  frequency  of  occurrence  of  that  time 
value. 

Notice  that  the  individual  programs,  when  run  at  different  times,  exhibit  different  run-times,  and  that 
these  run-times  fall  into  well  defined  quantile  points.  There  are  three  reasons  for  this: 

1.  The  Mips  M/500  has  a  16  K  byte  hardware  instniction  cache.  Programs  smaller  than 
one  page  (i.e.,  4  K  bytes)  will  reside  either  wholly  within  the  cache  or  wholly  outside  of 
K.  Programs  larger  than  one  page  but  less  than  4  pages  will  reside  wholly  or  partially 
in  the  cache,  or  they  may  not  be  in  the  cache  at  all.  Programs  larger  than  16  K  bytes 
will  have  some  variable  percentage  of  their  code  in  the  instruction  cache.  The  fraction 
of  a  program  that  is  in  the  cache  will,  to  a  large  degree,  determine  how  fast  that 
program  runs. 

2.  Whether  a  page  resides  in  the  instruction  cache  depends  on  a  number  of  factors,  one 
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Figure  5-3:  Distribution  of  Execution  Times  for  Similar  Programs 

of  which  is  the  physical  address  of  the  start  of  the  page.  On  any  single  run  of  a 
program,  the  Unix  operating  system  will  place  successive  pages  of  a  program  image 
into  the  first  available  pages  from  the  memory  pool.  These  pages  are  not  necessarily 
contiguous,  so  there  may  be  collisions  between  two  or  more  pages  in  a  program  in  the 
instruction  cache  hash  table.®^  The  fewer  collisions  there  are  within  a  given  program, 
the  more  effective  the  cache  is  at  speeding  up  ^e  execution  time  of  that  program. 

3.  The  file  access  mechanism  on  Unix  relies  on  an  i-node  which  points  at  the  pages  of  a 
program  when  it  is  on  disk,  and  also  serves  to  reference  the  program’s  pages  when  it 
is  in  main  memory.  Once  a  program  has  been  executed,  it  remains  in  memory  (even  if 
it  is  not  presently  being  executed)  until  its  pages  need  to  be  reclaimed.  If  a  program  is 
run  a  number  of  times  in  succession,  the  pages  that  comprise  the  program  will  remain 
at  the  same  virtual  and  physical  address  across  each  run.  If,  however,  other  programs 
are  run  in  between  executions,  then  the  pages  for  the  program  may  be  reclaimed,  and 
on  subsequent  execution  may  have  to  be  reloaded  from  disk  at  potentially  different 
addresses  in  memory.  Additionally,  if  a  copy  of  the  program  is  made  (effectively  creat¬ 
ing  a  new  i-node),  then  the  program  (now  referenced  by  a  different,  non-resident 
i-node)  must  be  loaded  into  memory,  probably  at  different  page  addresses.  Each  time 
the  addresses  of  the  pages  of  a  program  change,  the  program  may  run  at  a  different 
speed,  due  to  the  reasons  cited  in  the  first  two  Hems. 

The  net  effect  of  all  of  this  is  that,  in  general,  no  single  number  can  be  quoted  as  the  *run-time*  of  a 
program.  We  can  in  general  speak  of  best  case,  worst  case,  or  average  run-times.  However,  for 
any  processor  that  has  an  instruction  cache,  the  hH  rate  on  the  cache  is  determined  by  a  number  of 
factors  -  all  of  which  are  out  of  the  control  of  the  user  in  the  case  of  the  Mips  M/500  running  Unix. 
Ergo,  the  actual  run-time  of  a  program  is  non-deterministic  and  unpredictable,  although  the  range  of 
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values  in  which  the  run-time  will  fall  is  predictable.  Figure  5-4  shows  the  ranges  of  execution  times 
for  a  set  of  similar  programs,  in  this  case,  the  program  that  was  used  was  one  of  our  routine-call  test 
cases,  except  that  here  we  simply  varied  the  size  of  the  loop  being  executed,  rather  than  varying  the 
number  of  parameters. 
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Figure  5-4:  Variance  of  Execution  Times  of  Similar  Programs 

In  figure  5-4,  the  solid  line  represents  the  average  execution  time  as  the  program  size  increases. 
The  high-low  bars  indicate  the  minimum  and  maximum  execution  times,  and  the  dotted  lines  repre¬ 
sent  the  extrapolation  of  the  extremes.  The  distance  between  the  extremes  is  probably  highly  corre¬ 
lated  to  the  size  of  the  instruction  cache,  but  we  cannot  verify  this  predication  because  we  could  not 
change  the  hardware  cache  size.  We  do  know,  however,  that  without  an  instruction  cache,  the 
average  execution  time  would  probably  be  at  the  same  level  as  the  extrapolated  maximum,  with  very 
little  variation  between  runs. 

In  figure  5-4,  the  average  run-time  is  close  to  the  minimum  for  small  programs  and  climbs  toward  the 
maximum  for  larger  programs.  The  larger  the  program,  the  less  likely  it  will  be  wholly  cache  resident. 
Large  programs  (i.e.,  larger  than  16  Kb)  will  never  be  wholly  cache  resident,  although  they  can  still 
reap  the  benefits  of  an  instruction  cache  by  grouping  like  procedures  together  (Mips  has  a  program 
called  cord  to  aid  in  this  process).  Further  benefits  could  be  derived  with  a  more  robust  linker. 

Floating-point  programs  exhibit  a  much  smaller  range  between  the  minimum  and  maximum  values. 
We  suspect  that  this  is  due  to  the  fact  that  although  instructions  for  the  floating-point  co-processor 
may  be  kept  in  the  instruction  cache,  they  are  executed  in  the  co-processor  -  effectively  obviating 
the  cache.^  We  feel  (although  we  have  not  tested  this  hypothesis)  that  programs  with  a  greater 
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percentage  of  floating-point  instructions  will  demonstrate  a  reduced  variability  in  execution  speeds. 
Of  course,  regardless  of  program  content,  the  larger  the  program,  the  lower  the  variability  when  the 
program  counter  is  not  kept  within  a  1  page  boundary. 
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6.  Instruction  Set  Usage  by  the  Compilers 

This  chapter  covers  a  three  part  analysis  of  the  use  of  the  machine  instruction  set  by  the  Mips 
compilers.  The  first  part  is  a  static  analysis  in  which  we  examined  the  source  code  for  the  compilers 
to  determine  what  instructions  could  possibly  be  generated  from  a  source  program.  The  purpose  of 
this  test  was  to  get  a  feel  for  the  utility  of  the  instructions  in  the  instruction  set.  If  an  instruction  is 
never  used  by  the  compiler,  then  perhaps  it  is  because  it  is  too  difficult  for  the  compiler  detect  a  use 
for  the  instruction  (or  perhaps  it  is  a  special  instruction  that  was  never  expected  to  be  used,  such  as 
the  translation  lookaside  buffer  instructions  on  the  Mips  M/500,  or  the  context  switch  instructions  on 
the  VAX). 

The  second  part  is  a  thorough  analysis  of  a  specific  compiler  written  for  the  Mips.  The  compiler  is  for 
BCPL,  a  simple,  easy-to-implement  systems  programming  language.^®  The  purpose  of  this  exercise 
was  to  get  an  instrumented  view  of  the  instruction  set  in  relation  to  a  known  compiler,  and  to  eval¬ 
uate  patterns  of  register  use.  instruction  use  ,  instruction  mix,  addressing  mode  use,  and  addressing 
mode  effectiveness. 

The  third  part  is  an  analysis  of  instruction  and  register  use  across  a  number  of  large  programs.  In 
contrast  to  the  static  analysis,  this  'brute  force*  overview  examines  the  actual  instructions  and 
registers  that  are  used  for  a  set  of  programs.  This  analysis  does  not  provide  specific  insights;  we 
can  make  some  general  statements  about  the  compilers’  effectiveness  and  efficiency. 

However,  in  all  three  sections  we  compare  the  Mips  compilers  with  the  Vax  Berkeley  Unix  compilers. 
The  purpose  of  the  comparison  is  to  provide  to: 

•  Give  some  feel  for  the  use  of  a  RISC  versus  a  CISC  architecture  from  a  compiler 
standpoint.  We  hope  to  quantify  our  assertion  that  many  instructions  in  a  CISC  architec¬ 
ture  are  not  used  by  the  compiler,  and  thus  show  that  a  reduced  instruction  set  is 
reasonable. 

•  Provide  a  basis  of  comparison  that  most  of  our  readers  will  be  familiar  with. 

•  Highlight  the  differences  between  optimizing  compilers  (on  the  Mips)  and  less  sophis¬ 
ticated  compilers  (on  the  Vax). 

This  information  is  provided  to  deliver  insights,  not  tables  of  raw  figures. 


6.1.  Static  Analysis  of  Compilers 

This  section  examines  the  set  of  instructions  that  the  compiler  can  generate  (but  not  necessarily 
those  instructions  that  it  mV/ generate).  We  coffected  this  information  by  reading  through  the  source 
code  of  the  compilers.  Through  this  exercise  we  hope  to  shed  some  light  on  two  aspects  of  compiler 
and  processor  technology: 

1 .  What  subset  of  the  instruction  set  can  be  effectively  used  by  a  compiler  (and  from  this 
information,  what  an  effective  minimum  instruction  set  is). 


^BCPL  is  on«  of  the  ancestors  of  the  C  language. 
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2.  What  subset  of  the  instruction  set  cannot  be  used  by  a  compiler  (and  from  this,  which 
instructions  are  too  specialized  or  too  complex  to  be  effectively  fitted  to  a  source  code 
idiom). 

To  adequately  address  these  questions,  we  looked  at  the  compilers  for  both  the  Mips  and  the  Vax, 
the  latter  being  included  in  our  investigation  as  a  CISC  architecture,  and  hence  a  possible  coun¬ 
terexample  to  our  pro-RISC  argument.  As  shall  be  seen,  we  show  conclusively  that  a  RISC  architec¬ 
ture  is  a  much  better  choice  from  a  compiler  standpoint. 

6.1.1.  Mips  C,  Fortran,  and  Pascal  Compilers 

The  Mips  compiler  suite  currently  consists  of  three  different  language  front  end  parsers  (C,  Fortran, 
and  Pascal)  and  a  common  optimizer  and  code  generator.  To  analyze  the  use  of  the  Mips  instruc¬ 
tion  set  by  the  compilers,  we  were  forced  to  look  at  the  Mips  compilers  from  two  levels  -  the  high- 
level  instruction  set  use  generated  by  the  compiler,  and  the  low  level  instruction  set  executed  by  the 
Mips  M/500  hardware.  Ultimately,  only  the  low-level  instructions  get  executed,  so  the  most  signif¬ 
icant  tables  are  in  section  6. 1.1. 2.  However,  comparing  the  low-level  coverage  with  the  high-level 
coverage,  argues  in  favor  of  a  reduced  instruction  set.  In  spite  of  the  large  number  of  conditional 
instructions  provided  by  the  high  level  assembler  (26  set  and  branch),  all  of  the  instructions  are 
easily  emulated  with  less  than  one  third  as  many  real  instructions  (8  branch  and  set,  plus  xor). 

6.1.1 .1.  Mips  High«Level  Instruction  Use 

The  following  table  lists  the  full  (high-level)  instruction  set  of  the  Mips  architecture.  The  Mips  compil¬ 
ers  use  many  of  these  instructions.  If  an  instruction  is  used  by  the  compiler,^®  it  is  shown  in 
boldface.  Wherever  justifiable,  instructions  that  are  not  generated  by  the  compiler/optimizer  are 
shown  in  plain  text.  Instructions  that  are  unjustifiably  ignored  by  the  compiler  are  shown  in 
(italics).  Superscripted  numbers  refer  to  notes  at  the  end  of  the  table. 


abs 

add 

addu 

and 

b 

bal^ 

bcOf^ 

bcOt^ 

bclf 

belt 

bc2f2 

bc2t^ 

bc3f^ 

bc3t^ 

beq 

beqz'^ 

bge 

bgeu 

bgez^ 

(bgezal) 

bgt 

bgtu 

bgtz^ 

ble 

bleu 

blez"^ 

bit 

bltu 

bltz^ 

(bltzal) 

bne 

bnez^ 

break 

cO^ 

cl® 

c2^ 

c3^ 

cf  cO^ 

cf  cl® 

cf  c2® 

cf  c3^ 

ctcO^ 

ctcl^ 

ctc2® 

ctc3^ 

di.v 

divu 

j 

jal 

la 

lb 

Ibu 

Id 

Ih 

Ihu 

li 

lui® 

Iw 

IwcO® 

Iwcl® 

lwc2^ 

lwc3^ 

Iwl® 

Iwr® 

mf  cO^ 

mfcl 

mfcl.d 

mf  c2^ 

mf  c3® 

(mfhi) 

(mflo) 

move 

mtcO^ 

mtcl 

mtcl .d 

mtc2^ 

mtc3^ 

(mthi) 

(mtlo) 

mul 

(mulo) 

fmulou) 

(mult) 

(multu) 

(neg) 

negu 

nop 

nor® 

not 

or 

^In  this  case,  by  ‘compiler  we  mean  the  combination  of  the  language-specific  frontend,  and  the  common  back  end.  In 
analyTing  which  instnictions  are  generated,  we  examined  only  the  back  and  (i.e.,  the  code  generator)  and  assumed  that  if 
there  was  code  in  the  back  end,  some  compiler  would  support  it. 
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rem 

remu 

rfe7 

rol® 

ror® 

sb 

sd 

aeq 

sge 

sgeu 

sgrt 

sgtu 

ah 

sle 

sleu 

sll 

sit 

situ 

sne 

sra 

srl 

fsubi 

subu 

sw 

awcO^ 

swcl^ 

swc2^ 

swcS'^ 

awl® 

swr® 

syscall^ 

tlbp’O 

tlbr’® 

tlbwi’° 

tlbwr’® 

(ulh) 

xor 

(ulhu) 

(ulw) 

(ush) 

(usw) 

abs  .d 

abs .  a 

add.d 

add.  a 

c.eq.d 

c . eq. s 

c.  f  .d” 

c.f.s” 

c .  le .  d 

c . le . s 

c . It .  d 

c .  It .  a 

c.nge.d” 

c . nge . a  ” 

c .  ngl .  d” 

c  .ngl .  s 

c .  ngle  .  d” 

c.  ngle.  a” 

c .  ngt .  d” 

c . ngt . a  ” 

c .  ole . 

c.ole  .a” 

c .  olt .  d” 

c . olt  .a” 

c .  seq.  d” 

c .  seq.  s’’ 

c .  sf .  d” 

c.sf  .s” 

c.ueq.d” 

c . ueq . a  ” 

c .  ule  .  d” 

c.ule.s” 

c .  ult .  d” 

c.ult  .a” 

c .  un .  d” 

c .  un .  s  ” 

cvt .  d.  a 

cvt .  d .  w 

cvt . a . d 

cvt . s  .  w 

cvt .  w .  d 

cvt . w. a 

div.d 

div.  a 

l.d 

l.s 

stov.  a 

mov.d 

mul  .d 

mul .  a 

neg.d 

nag.  a 

round . w . d 

round . w . a 

a  .d 

a .  a 

aub.  d 

sub.  a 

trunc.w.d 

trunc.w.  a 

Notes 


1 .  The  Mips  compilers  suffer  from  a  common  problem.  The  bai  instruction  is  not  used 
because  the  compiler  has  no  facility  for  determining  at  compile  time  the  address  of  the 
target,  and  hence  no  knowledge  whether  the  target  will  be  out  of  range  of  a  branch. 
The  target  will  usually  be  in  range  when  recursion  is  used,  although  the  Mips  compiler 
does  not  take  advantage  of  this  knowledge. 

2.  The  Mips  M/500  provides  instaiclion  set  support  for  4  co-processors.  However,  only 
co-processor  1  (the  floating  point  co-processor)  is  presently  supported  in  hardware.  Of 
course,  the  extra  co-processor  instructions  will  not  be  generated  for  non-existent 
hardware. 

3.  Certain  co-processor  instructions  do  not  make  any  sense  for  the  floating  point  co¬ 
processor,  since  their  functions  are  not  supported  by  a  floating  point  unit. 

4.  Although  instructions  at  ?  provided  in  the  high  level  assembly  language  for  conditional 
branches  relative  to  zero,  the  compiler  simply  generates  an  ordinary  conditional  branch 
relative  to  the  zero  register.  In  the  end,  these  instructions  are  functionally  equivalent. 

5.  The  lui  instruction  is  available  to  the  high-level  assembler,  but  it  is  not  really  needed. 
It  is  used  primarily  to  load  an  immediate  value  of  larger  than  1 6  bits  on  the  real  Mips 
M/500  hardware  (while  the  high  level  assembler  allows  a  full  32  bit  operand). 

6.  These  speda)  load  instructions  could  conceivably  be  used  in  C  to  load  structure  com¬ 
ponents  stored  in  registers,  but  their  primary  function  is  to  be  used  in  the  unaligned 
load  and  store  instructions. 

7.  The  syscaii  and  rfe  instructions  are  used  to  perform  system  calls,  a  function 
handled  by  the  subroutine  libraries. 

8.  None  of  the  high-level  lartguages  on  the  Mips  has  a  nor  function,  hence  the  nor 
instruction  is  not  used. 

9.  None  of  the  high-ievel  languages  on  the  Mips  has  a  rotate  function,  hence  the  roi  and 
ror  instructions  are  not  used. 
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10.  These  instructions  reference  the  translation  lookaside  buffer  and  are  used  primarily  in 
the  kernel,  and  then  only  in  assembly  language. 

1 1 .  These  instructions  are  provided  by  the  floating-point  co-processor  to  supply  complete 
IEEE  floating-point  compatibility.  They  are  not  all  necessary  for  the  languages  avail¬ 
able  on  the  Mips. 

6.1. 1.2.  Mips  M/500  Low  Level  Instruction  Use 

The  following  table  lists  the  full  (native)  instruction  set  of  the  Mips  M/500  architecture.  Since  the 
Mips  compilers  do  not  generate  these  instructions  directly,  but  rely  on  the  assembler  reorganizer,  it  is 
only  partially  true  that  the  compilers  use  these  instructions.  If  an  instruction  is  used  by  the 
compiler,'”  it  is  shown  in  boldface.  Wherever  justifiable,  instructions  that  are  not  generated  by  the 
compiler/optimizer  are  shown  in  plain  text.  Instructions  that  are  unjustifiably  ignored  by  the  com¬ 
piler  are  shown  in  (italics) .  The  superscripted  numbers  refer  to  notes  at  the  end  of  the  table. 


add 

addi 

addiu 

addu 

and 

andi 

b 

bcOf ' 

bcOt  ^ 

bclf^ 

bclt^ 

beq 

bgez 

(bgezal) 

bgtz 

blez 

bltz 

(bltzal) 

bne 

break 

cO' 

c2' 

c3’ 

cfcl 

ctcl 

div 

divu 

j 

jal 

jalr 

lb 

Ibu 

Ih 

Ihu 

li 

lui 

Iw 

IwcO’ 

Iwcl 

lwc2  ^ 

lwc3  ^ 

Iwl® 

Iwr® 

mf  cO  ^ 

mfcl 

mfhi 

mflo 

move 

mtcO  ^ 

mtcl 

mthi^ 

mtlo® 

mult 

multu 

nop 

nor 

Or 

ori 

sb 

sh 

sll 

sllv 

alt 

slti 

sltiu 

situ 

sra 

srav 

arl 

srlv 

sub 

subu 

sw 

swcO’ 

awcl 

swc2  * 

swc3  ’ 

swl® 

swr® 

syscall^ 

xor 

xori 

abs . d  abs . s  add . d  add . s  c . eq . d 

c.f.d®  c.f.s®  c.le.d  c.le.s  c.lt.d 

c.nge.d^  c.nge.s^  c.ngl.d®  c.ngl.s®  c.ngle.d® 
c.ngt.d®  c.ngt.s®  c.ole.d®  c.ole.s®  c.olt.d® 

c.seq.d®  c.seq.s®  c.sf.d®  c.sf.s®  c.ueq.d® 

c.ule.d®  c.ule.s®  c.ult.d®  c.ult.s®  c.un.d® 

cvt . d . s  cvt . d . w  cvt . s , d  cvt . s . w  cvt . k . d 

div .  d  div .  s  inov .  d  mov .  s  mul .  d 

neg.d  neg.s  sub.d  sub. 5 

Notes 

1 .  The  Mips  M/500  provides  instruction  set  support  for  4  co-processors.  However,  only 
co-processor  1  (the  floating-point  co-processor)  is  presently  supported  in  hardware.  Of 
course,  the  extra  co-processor  instructions  will  not  be  generated  for  non-existent 
hardware. 

2.  Certain  co-processor  instructions  do  not  make  any  sense  for  the  floating  point  co¬ 
processor,  since  their  functions  are  not  supported  by  a  floating-point  unit. 

3.  The  hi  (lo)  registers  are  documented  to  "hold  the  most  (least)  significant  32  bits  of 


this  cas«,  by  "compiler'  we  mean  the  combination  of  the  language-specific  frontend  ard  the  common  back  end  and  the 
assembler  reorganizer. 


c  .  eq .  s 
c . It .  s 

g 

c . ngle . s 
c . olt . s® 
c . ueq. s® 
c  .  un .  s 

cvt . w . s 
mul .  s 
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multiply,  quotient,  or  divide.’  Since  these  are  result  registers,  we  do  not  expect  that  a 
compiler  would  have  any  reason  to  load  them  explicitly. 

4.  The  syscail  instruction  is  used  to  perform  system  calls,  a  function  handled  by  the 
subroutine  libraries. 

5.  These  special  load  instructions  could  conceivably  be  used  in  C  to  load  structure  com¬ 
ponents  stored  in  registers,  but  their  primary  function  is  to  be  used  in  the  unaligned 
load  and  store  instructions,  none  of  which  are  generated  by  the  compiler. 

6.  These  instructions  are  provided  by  the  floating-point  co-processor  to  supply  complete 
IEEE  floating-point  compatibility.  They  are  not  all  necessary  for  the  languages  avail¬ 
able  on  the  Mips. 

6.1.2.  Berkeley  C  and  Fortran  Compilers 

By  way  of  comparison,  we  examined  the  Berkeley  C  and  Fortran  compilers  and  the  way  they  use 
the  VAX  assembly  instruction  suite.  The  following  table  lists  the  full  instruction  set  of  the  Vax  ar¬ 
chitecture.  The  Berkeley  C  and  Fortran  compilers  use  many  of  these  instructions.  If  an  instruction 
is  used  by  the  compiler,^^  it  is  shown  in  boldface.  Wherever  justifiable,  instructions  that  are  not 
generated  by  the  compiler/optimizer  are  shown  in  plain  text.  Instructions  that  are  unjustifiably 
ignored  by  the  compiler  are  shown  in  (italics).  Superscripted  numbers  refer  to  notes  at  the  end 


of  the  table. 

(acbb) 

(acbd) 

(acbf) 

acbg’® 

acbh’® 

acbl^ 

(acbw) 

adawi*® 

addb2^ 

(addb3) 

addd2 

addd3 

addf2 

add£3 

addg2’‘’ 

addg3 

addh2’‘’ 

addh3*® 

addl2 

addl3 

addp4 

addp6^^ 

addw2^ 

(addw3) 

(adwc) 

aobleq^ 

aoblss^ 

ashl 

ashp’^ 

(asbq) 

bbc^ 

(bbcc) 

bbcci^^ 

(bbcs) 

bbs^ 

bbsc^ 

(bbss) 

bbssi*^ 

(bcc) 

(bes) 

beql’ 

b«qlu^ 

bgeq’ 

bgequ’ 

bgtr^ 

bgtru^ 

bicb2^ 

(bicbS) 

bicl2 

bicl3 

(bicpsw) 

bicw2^ 

(bicv3) 

bisb2^ 

(bisb3) 

bisl2 

bi8l3 

(bispsw) 

bi8w2^ 

(bisw3) 

bitb 

bibl 

bitw 

blbc’ 

blbs^ 

bleq^ 

blequ^ 

blss’ 

blssu* 

bneq’ 

bnequ^ 

bpt® 

(brb) 

bnr 

(bsbb) 

(bsbw) 

bugl® 

bugw® 

(bvc) 

(bvs) 

(callg) 

calls 

(caseb) 

easel 

(casew) 

chme^ 

chmk^ 

chms^ 

chitiu^ 

clrb 

clrd 

clr£ 

clrg'O 

clrh’<' 

clrl 

(clro) 

(clrq) 

cl  nr 

enpb 

cmpc3^^ 

cmpcS 

cnpd 

cmpt 

empg*® 

emph’® 

cmpl 

cmpp3 

cmpp4^^ 

(empv) 

cnq>w 

(cmpzv) 

crc 

cvtbd 

cvtbf 

cvtbg^® 

*^ln  this  case,  by  ‘compiler’  we  mean  the  combination  of  the  code  generator  and  optimizer,  since  the  Berkeley  compiler 
suite  spots  these  two  tasks  into  two  separate  programs  (which,  instead  of  operating  on  a  common  intermediate  form,  share 
information  in  assembler  source  code  format).  Some  instructions  are  therefore  not  generated  directly  by  the  compiler,  but  are 
inserted  by  the  optimizer  to  match  certain  code  idioms.  The  origin  of  the  instruction  is  urtimportant.  Rather,  it  is  more 
important  ^at  it  is  used  at  all.  The  C  and  PorrmAN  compilers  differ  only  in  the  front  end  -  the  code  generator  is  shared  by  both 
languages. 
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cvtbh^® 

cvtbl 

cvtbw 

cvtdb 

cvtdf 

cvtdh**^ 

cvtdl 

cvtdw 

cvtfb 

cvtfd 

cvtfg*® 

cvtfh*° 

cvtfl 

cvtfw 

cvtgb^® 

cvtgf 

cvtgh^® 

cvtgl^® 

cvtgw^^ 

cvthb*® 

cvthd^® 

cvthf’° 

cvthg*® 

cvthl 

evthw^® 

cvtlb 

cvtld 

cvtlf 

cvtlg*® 

cvtlh’° 

cvtlp’* 

cvtlw 

cvtpl” 

evtps 

cvtpt  ^  ^ 

(cvtrdl) 

(cvtrfl ) 

cvtrgl^® 

cvtrhl’® 

evtsp^’ 

cvttp’^ 

cvtwb 

cvtwd 

cvtwf 

cvtwg^^ 

cvtwh*® 

cvtwl 

deeb 

decl 

decw 

divb2-* 

(divb3) 

divd2 

di.vd3 

div£2 

divf3 

divg2’° 

divg3^® 

divh2^° 

divh3'® 

divl2 

divl3 

divp” 

divw2'* 

(divw3) 

editpc*^ 

(ediv) 

(emodd) 

(emodf) 

emodg 

emodh^^ 

(emul) 

esed 

esce 

esef 

extv 

extzv 

ife^* 

halt® 

incb 

incl 

inew 

index 

insqhi^'^ 

.  Id 

xnsqti 

insque*'^ 

insv 

jbe’”^ 

(jbcc) 

Obcs; 

jbr^ 

jbs^-^ 

jbsc® 

(jbss) 

jeql’ 

jeqlu’ 

jg«q’ 

jgequ’ 

jgtr’ 

jgtru’ 

jlbc’-^ 

jlbs’-^ 

jleq’ 

jlequ’ 

jlss^ 

jlssu* 

jnq)’ 

jneq^ 

jnequ^ 

jab® 

Idpctx® 

locc^^ 

matchc^'^ 

(mcomb) 

mcoml 

(mcomw) 

mfpr® 

mnegb 

mnegd 

mnegf 

innegg^® 

innegh*® 

mnegl 

mnegir 

siovab 

(movad) 

Cmova/; 

movag^® 

movah’® 

xnoval 

(movao) 

movaq^ 

movaw^ 

XDOVb 

inovc3® 

moves 

movd 

inovf 

10 

movg'*' 

itiovh^® 

movl 

(movo) 

movp^^ 

(movpsl) 

aovq 

movtc^^ 

movtuc^^ 

movw 

movzbl 

movzbw 

movzwl 

mtpr® 

mulb2^ 

(mulb3) 

iBuld2 

xauld3 

inulf2 

inul£3 

mulg2^® 

mulg3 

mulh2'® 

mulh3 

niull2 

mull3 

itiulp*^ 

mulw2^ 

(mulw3) 

nop 

polyd*^ 

polyf 

polyg^®'^^ 

polyh^®’’^ 

Cpopr; 

prober® 

probew® 

pushab^ 

fpushad; 

fpushaf; 

pushag’® 

pushah*® 

pushal^ 

(pushao) 

(pusbaq) 

(pushaw) 

pus  hi 

(pvshr) 

rei* 

remqhi 

remqti^^ 

remque 

ret 

(rotl) 

(rsb) 

scanc^^ 

skpe’** 

sobgeq^ 

sobgtr^ 

spanc^^ 

aubb2^ 

(subb3) 

subd2 

subdS 

8ub£2 

sub£3 

subg2^® 

subg3 

subh2’° 

subh3’® 

8ubl2 

suol3 

s’Jbp4 

3ubp6*^ 

8tabi«2^ 

(subw3) 

svpctx® 

tstb 

tstd 

t8t£ 

tstg’® 

tsth’° 

tatl 

tatw 

xfc® 

xorba-* 

(xorw3} 

^xorb3; 

xorl2 

xorl3 

xorw2^ 
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Notes 


1 .  The  jbx-x-x  instructions  are  pseudo-instructions  that  are  converted  to  either  the  cor¬ 
responding  branch  instruction  or  an  inverse-sense  branch/jump  instruction  pair  by  the 
assembler.  Correspondingly,  the  bxxx  are  only  generated  by  the  assembler,  the  brw 
instruction,  which  is  also  generated  directly  by  the  compiler.  The  beqlu  instruction  is 
identical  to  the  beql  instruction  (since  an  unsigned  test  for  equality  is  the  same  as  a 
signed  test  for  equality);  the  VAX  ISA  simply  provides  two  mnemonics  for  the  same 
instruction.  The  same  is  true  for  bnequ  and  bneq. 

2.  These  instructions  are  generated  only  by  the  common  optimizer  pass,  not  by  the  com¬ 
piler.  While  this  is  not  bad,  it  indicates  a  weakness  of  the  common  code  generator  that 
the  optimizer  must  compensate  for. 

3.  This  instruction  is  used  exclusively  in  the  conversion  from  unsigned  longword  integers 
to  floating  or  double  variables,  not  for  the  intended  function  of  intra-processor 
semaphore  interlocks. 

4.  The  add,  sub,  mul,  div,  bis,  bic,  and  xor  instructions  for  byte  and  word  operands 
are  produced  only  in  the  two-operand  form,  while  the  corresponding  instructions  for 
long,  floating,  and  double  formats  are  produced  in  both  two-  and  three-operand  form. 

5.  The  jsb  instruction  is  not  generated  in  any  normal  code  sequence,  but  only  as  an 
interface  mechanism  to  the  run-time  profiler.  No  subroutines  are  ever  called  with  any¬ 
thing  except  the  calls  linkage. 

6.  The  moves  instruction  is  used  to  copy  c  structures,  not  to  copy  character  strings.  This 
is  because  the  instruction  takes  as  its  first  operand  the  number  of  bytes  to  be  moved, 
but  the  c  representation  of  strings  is  such  that  this  datum  is  not  readily  available. 

7.  The  chmx  instructions  are  intended  to  be  used  to  switch  between  processor  modes, 
and  are  of  questionable  utility  to  a  compiler.  The  chmk  instruction  is  used  by  the  Unix 
libraries  (written  directly  in  assembly  language)  to  effect  kernel  calls. 

8.  These  instructions  are  designed  for  use  in  an  operating  system  context  and  cannot  be 
expected  to  be  generated  by  a  compiler.  Additionally,  some  of  them  are  privileged 
instructions  and  can  only  be  executed  in  kernel  mode. 

9.  These  are  very  special  case  instructions  (for  use  in  debuggers  and  other  applications) 
that  cannot  reasonably  be  generated  by  a  compiler. 

1 0.  The  g  and  h  floating-point  types  are  not  supported  by  Unix  languages  and  are  not 
available  on  all  versions  of  the  Vax.  Consequently,  it  is  reasonable  to  allow  a  portable 
compiler  to  not  generate  them. 

1 1 .  The  packed  decimal  instructions  are  in  the  Vax  ISA  primarily  for  DEC’S  version  of  PL/1 
(which  the  Berkeley  compilers  do  not  support). 

12.  The  caiig  instruction  is  designed  for  Fortran  static  call  frames,  although  Berkeley 
Fortran  does  not  take  advantage  of  it. 

13.  The  adawi,  bbssi,  and  bbcci  instructions  are  designed  for  multiprocessor  applica¬ 
tions. 

14.  The  polynomial  and  cre  instructions  are  designed  to  make  assembly  language  pro¬ 
gramming  easier.  The  character  manipulation  instructions  support  complex  character 
comparison,  matching,  and  insertion.  All  of  these  instructions  provide  support  for  high- 
level  functions  not  present  in  c  or  Fortran.  It  would  be  unreasonable  to  expect  most 
compilers  to  gerrerate  these  instructions  without  the  corresponding  higher  ievel  ian- 
guage  primitives. 
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6.1.3.  Comparison  of  Compiler  Coverage 

The  Mips  high-level  instruction  set  contains  192  instructions,^^  of  which  94  (or  almost  49%)  are 
unused  by  the  compilers.  This,  however,  is  a  somewhat  unfair  measure.  If  we  exclude  the  instruc¬ 
tions  that  are  used  for  non-existent  co-processor  functions  and  the  extraneous  floating-point  instruc¬ 
tions  that  are  present  to  satisfy  the  IEEE  standard,  then  of  the  remaining  134  instructions,  only  36  (or 
slightly  less  than  26%)  are  unused  by  the  compilers.  To  be  still  fairer  to  the  Mips  compiler,  we  count 
only  those  instructions  that  we  considered  'unjustifiably  ignored  by  the  compiler',  then  only  17  in¬ 
structions  (or  approximately  12%)  of  the  instructions  are  unused. 

Because  the  Mips  assembler/reorganizer  is  really  a  macro  assembler,  we  must  also  look  at  the 
coverage  of  the  native  instruction  set  by  the  compilers,  even  if  this  coverage  is  through  one  level  of 
indirection.  Of  the  135  actual  instructions  (including  the  floating-point  co-processor  instructions'*^), 
50  instructions  (or  37%)  are  unused  by  the  compiler.  However,  if  we  exclude  superfluous  floating¬ 
point  and  co-processor  instructions  then  of  the  remaining  92  instructions,  only  7  (a  mere  7.5%)  are 
unused.  Of  these,  only  2  fall  under  the  category  of  'unjustifiably  ignored'  instructions.  Clearly,  the 
Mips  M/500  instruction  set  is  sufficiently  small  to  be  manageable  by  the  compiler,  but  sufficiently 
large  to  handle  the  programming  tasks  it  is  designed  to  handle. 

By  way  of  comparison,  the  Vax  native  instruction  set  contains  a  total  of  323  instructions,^®  of  which 
179  (or  55%)  are  unused  by  the  compiler.  This  could  lead  us  to  believe  that  either  the  compiler  is 
terribly  inefficient,  or,  a  more  likely  conclusion,  that  the  instruction  set  is  far  too  complex.  Even  when 
we  exclude  the  'special'  instructions  for  G  and  H  floating-point  formats  (which  are  not  supported  by 
all  VAX  processors),  then  of  the  remaining  267  instructions,  121  (or  45%)  are  unused.  To  be  ab¬ 
solutely  fair,  if  we  count  only  those  instructions  which,  in  the  previous  analysis,  we  felt  were 
'unjustifiably  ignored  by  the  compiler',  we  still  find  that  over  23%  of  the  instruction  repertoire  of  the 
VAX  is  used  by  the  compiler,  and  that  another  22%  are  of  a  complex  or  systems  programming 
nature. 

By  looking  at  these  numbers,  it  is  clear  from  a  compiler  standpoint  at  least  that  a  RISC  architecture  is 
better  used  by  a  compiler  than  a  CISC  architecture.  One  of  rhe  bottlenecks  of  a  CISC  computer  is 
instruction  decoding.  Removing  unneeded  instructions  can  speed  up  a  CISC  processor  (and  at  the 


^Remember  that  the  number  of  instructions  in  the  high  level  assembly  language  for  the  Mips  does  not  reflect  the  number  of 
instructions  found  on  the  Mips  M/SOO  native  instruction  set.  Many  of  the  high  level  instructions  are  simply  macros  that  are 
expanded  by  the  assembler  reorganizer.  See  chapter  3  for  details. 

^According  to  the  source  code  for  the  disassembler  program  dia.  there  is  the  potential  for  many  more  instructions  available 
on  the  Mips  M/500.  However,  it  is  unclear  how  many  of  these  are  actually  present  in  the  hardware,  and  how  many  were 
planned  but  never  inscribed  in  silicon.  We  will  use  as  our  instruction  count  ^  number  of  instructions  that  can  be  created  by 
the  assembler  reorganizer,  given  the  set  of  instructiorts  documented  in  the  'Assembly  Language  Programmer's  Guide’  and 
revealed  by  the  translation  table  in  appendix  3. 

We  note  also  that  a  slightly  different  measurement  criterion  has  been  used  on  the  Mips  M/500  than  on  the  Vax.  On  the 
Mips  M/500,  ‘add"  and  'add  immediate*  are  considered  two  different  instructions,  while  on  the  Vax,  they  are  considered  to  be 
one  instruction  with  two  different  addressing  modes.  If  we  foNow  the  Vax  metric,  the  Mips  M/500  has  14  fewer  instructions— a 
figure  which  belter  shows  off  the  RISC  nature  of  the  architecture. 

^his  is  a  count  of  real  instructions  and  does  not  include  the  21  )bxx  pseudo-instnietions  provided  by  the  Beikeley 
assembler. 
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same  time,  convert  it  to  a  RISC  processor).  The  later  analysis  of  instruction  set  coverage  shows  that 
this  is  a  wise  move  to  make,  since  a  large  fraction  of  a  "standard"  CISC  architecture  is  never  used  by 
the  compiler  (nor,  we  suspect,  by  a  human  programmer). 


6.2.  Assessment  of  BCPL/MiPS 

This  section  contains  a  brief  description  and  assessment  of  the  BCPL/Mips  compiler  created  at  the 
SEI.  It  contains  numbers  specific  to  the  Mips  RISC-based  workstation  and  some  comparisons  with 
the  DEC  MicroVAX  II. 

The  compiler  consists  of  a  front  end  that  translates  BCPL  into  an  intermediate  form  called  Ocode, 
and  a  back  end  that  translates  Ocode  into  symbolic  assembly  language,  which  is  then  assembled  by 
the  target  machine  assembler  program. 

The  front  end,  called  bcpi,  is  common.  The  back  ends  for  Mips  and  Vax,  are  called  cgmips  and 
cgvax,  respectively.  The  structure  of  the  two  back  ends  is  very  similar;  cgvax  performs  a  few  extra 
peephole  optimizations,  but  othenivise  the  generated  code  is  of  similar  quality.  This  allows  us  to 
make  a  direct  comparison  between  the  two  machines. 

The  vehicle  for  comparison  is  the  cgmips  code  generator  itself,  which  is  a  BCPL  program  with  about 
4400  lines  of  source,  of  which  about  50%  are  white  space  or  comment.  It  is  divided  into  four 
modules  numbered  1  through  4. 

The  purpose  of  this  assessment  is  to; 

1.  Obtain  comparative  performance  measurements  in  a  manner  that,  as  far  as  possible, 
reflects  the  hardware  rather  than  the  combination  of  hardware  and  compiler. 

2.  Test  the  claims  that  CISC  architectures  are  too  complicated  and  embody  expensive 
but  unused  features,  whereas  RISC  machines  are  sufficient  for  most  purposes  and 
more  efficient  overall. 

It  is  possible  to  approach  such  a  task  in  two  ways.  One  can  run  very  large  amounts  of  code  through 
the  two  compiling  systems  and  accumulate  statistics  (section  6.3  describes  this  approach).  Or  one 
can  use  a  small  amount  of  code  only  and  try  to  understand  and  explain  the  results.  This  section 
follows  the  latter  course. 

The  host  systems  on  which  the  analysis  was  perfomned  were: 

•  DEC  MicroVAX  II  running  Mach  (4.3  BSD  Unix).  This  is  considered  to  be  a  machine  of 
about  0.9  "mips". 

•  Mips  M/500  workstation  running  4.3  BSD  Unix,  with  a  16K  byte  l-cache  and  an  8K  byte 
D-cache.  This  is  claimed  to  be  a  *4  to  5  mips”  machine. 
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6.2.1 .  Performance  Analysis 

We  collected  the  following  data  for  both  Vax  and  Mips; 

•  code  size 

•  code  density 

•  bcpi  execution  speed 

•  cgmips  execution  speed 

•  assembler  execution  speed. 

These  figures  and  appropriate  totals  and  ratios  are  given  below,  along  with  explanatory  text. 

6.2.1. 1.  Code  Size 


Module 

1 

2 

3 

4 

Total 

Bytes 

4234 

4293 

4894 

2482 

15903 

Instructions 

1025 

1192 

1304 

642 

4163 

Bytes/lnstr 

3.80 

Table  6-1 :  Results  of  cgrrvps  Compiled  on  the  Vax 

The  code  size  in  table  6-1  includes  case  statement  jump  tables;  without  them  the  average  bytes  per 
instruction  is  3.64.  Note  that  each  entry  in  such  a  table  occupies  2  bytes  on  the  Vax  but  4  bytes  on 
the  Mips. 


Module 

1 

2 

3 

4 

Total 

Bytes 

6420 

6112 

7480 

3756 

23768 

Instructions 

1 

1448 

1502 

1818 

837 

5605 

Bytes/lnstr 

4.24 

Table  6-2:  Results  of  cgrrips  Compiled  on  the  Mips 


The  code  size  in  table  6-2  includes  case  statement  jump  tables;  without  them,  the  average  bytes  per 
instruction  is  4.00.  However,  this  figure  excludes  any  code  expansion  in  the  assembler  reorganizer. 
Initial  measurements  showed  this  expansion  to  be  considerable  (over  40%;  see  section  3.2);  how¬ 
ever,  much  of  this  expansion  was  due  to  assembler  decisions  that  were  not  entirely  appropriate. 
After  making  changes  in  the  Assembler  source,  and  minor  changes  in  the  order  in  >which  the  code 
generator  emitted  instructions,  we  were  able  to  reduce  the  expansion  to  25%,  and  we  believe  that 
further  work  could  reduce  it  to  about  9%.  This  is  discussed  below. 

6.2.1 .2.  Code  Density 

by  bytes .  1 .50  (1 .90  after  assembly) 

by  instructions .  1 .35  (1 .70  after  assembly) 

TtiAe  6-3:  Code  Expansion  Mips  /  Vax 

Note  that  the  Vax  code  density  is  very  high  because  the  code  generation  strategy  uses,  wherever 
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possible,  address  modes  with  short  offsets.  The  density  of  the  output  of  pcc,  for  example,  is  sub¬ 
stantially  lower.^®  On  the  average,  execution  of  each  Vax  instruction  required  about  eight  cycles. 
Execution  of  each  Mips  instruction  required  a  little  more  than  one  cycle. 

6.2.1 .3.  BCPL  Execution  Speed 

Execution  speed  is  given  in  terms  of  user  process  time  as  measured  by  Unix.  Since  both  machines 
were  workstations  with  a  single  user,  this  correlates  quite  closely  with  elapsed  time. 


cgmips  compiled  from  source  to  Ocode,  on  Vax 


Module 


Time  (sec.) 


This  is  approximately  5000  lines/min. 


cgmips  compiled  from  source  to  Ocode,  on  Mips 


Module 


Time  (sec.) 


This  is  approximately  25000  lines/min.  Overall,  this  program  executes  5.1  times  as  fast  on  Mips. 

6.2.1. 4.  Cgmips  Execution  Speed 


cgmips  compiled  from  Ocode  to  Assembler,  on  Vax 


Module 


Time  (sec.) 


This  is  approximately  6000  lines/min. 


cgmips  compiled  from  Ocode  to  Assembler,  on  Mips 


Module 


Time  (sec.) 


This  is  approximately  28000  lines/min.  Overall,  this  program  executes  4.5  times  faster  on  Mips. 

6.2.1 .5.  Combined  Execution  Speed 


cgmips  compiled  from  source  to  Assembler,  on  Vax 


Module 


Time  ^sec.; 


This  is  approximately  2600  lines/min. 


**An  avarage  o(  5.66  bytas  par  instruction  for  our  benchmarks. 
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cgmips  compiled  from  source  to  Assembler,  on  Mips 

Module  1  2  3  4  Total 


Time  (sec.) 


This  is  approximately  1 3000  lines/min.  Overall,  the  compiler  and  code  generator  execute  4.8  times 
faster  on  the  Mips.  For  small  programs  (less  than  32k  bytes),  this  is  probably  an  accurate  reflection 
of  the  intrinsic  speed  of  the  machine. 

6.2.1. 6.  Assembler  Execution  StTeed 

This  comparison  is  different  from  the  previous  tests.  We  measured  the  time  taken  to  assemble 
cgmips  on  both  Vax  and  Mips,  using  in  each  case  the  native  Assembler  program.  There  are  several 
points  to  note; 

1 .  Both  programs  had  to  have  several  bugs  fixed,  which  should  not  have  affected  their 
speed. 

2.  The  two  programs  are  assembling  different  input  files,  and  the  Vax  input  is  about  25% 
smaller.  However,  the  files  contain  functionally  equivalent  programs. 

3.  We  are  measuring  the  combined  effect  of  the  hardware  speed  and  the  software  perfor¬ 
mance. 


cgmips  assembled  from  Assembler  to  Object,  on  Vax 


This  corresponds  to  a  rate  of  assembly  of  approximately  14000  lines/min. 


cgmips  assembled  from  Assembler  to  Object,  on  Mips 

Module  1  2  3  4  Total 


Time  (sec.) 


This  corresponds  to  a  rate  of  assembly  of  is  approximately  6700  lines/min.  Overall,  the  Mips  as¬ 
sembler  takes  three  times  as  tong  as  the  Vax  assembler  for  the  same  program,  or  more  than  twice 
as  long  for  the  same  number  of  instructions.  Since  it  is  executing  on  a  machine  intrinsically  almost 
five  times  faster,  this  represents  a  difference  in  software  performance  of  more  than  an  order  of 
magnitude. 

Granted,  the  Mips  Assembler  is  doing  more  -  for  example,  it  is  performing  the  code  reorganization 
required  by  the  target.  Nevertheless,  the  above  data  represent  an  example  of  how  software  can 
degrade  objective  performance  faster  than  hardware  can  enhance  it. 

The  full  compilation  times  are; 

•  cgmips  compiled  from  Source  to  Object,  on  Vax:  1 1 3.7  sec 

•  cgmips  compiled  from  Source  to  Object,  on  Mips:  69.6  sec 
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On  VAX,  the  assembler  pass  takes  less  than  15%  of  the  time;  on  Mips,  it  takes  more  than  70%  of  the 
time.'*^ 

6.2.2.  Instruction  Reorganization  on  Mips 

When  the  generated  Mips  code  of  cgmips  was  first  submitted  to  the  reorganizer,  the  number  of 
instructions  increased  from  5605  to  81 16  (by  44.8%).  This  was  a  far  greater  expansion  than  we  had 
expected,  and  reasons  were  sought. 

Our  first  observation  was  that  the  reorganizer  was  generating  the  full  32-bif  addressing  idiom  for  all 
static  operands,  even  though  the  code  generator  was  following  the  rules  for  generating  only 
gp-relative  static  data.  The  error  was  traced  to  a  bug  in  the  assembler;  when  this  was  fixed,  the 
number  of  lui  instructions  generated  was  reduced  from  1086  to  136. 

A  further  reduction  could  be  made  by  observing  that  the  assembler,  though  now  generating  single 
instructions  for  loads  and  stores  of  most  static  data,  still  accessed  local  static  data  using  two  instruc¬ 
tions.  This  was  traced  to  an  interesting  feature;  the  assembler  con-ectly  handled  gp-relative  ad¬ 
dresses  only  for  operands  declared  before  the  operation  referencing  them  (see  section  9).  Making 
this  fix  replaced  125  lui/addi  pairs  with  125  simple  li  instructions. 

This  left  the  reorganized  code  count  at  7041,  an  expansion  of  1435  instructions  (25.6%).  These 
extra  instructions  exhibit  the  pattern  shown  in  figure  6-1 : 


nop 


compute  control 


move 


Q  nop  -  84.3% 

B  move  -  8.0% 

Q  compute  -  2.0% 
■  control  -  5.7% 


Figure  6-1 :  Extra  Instructions  -  Pattern  of  Use 

However,  even  these  figures  are  too  high.  For  reasons  explained  in  section  3.2,  the  reorganizer 
must  make  very  pessimistic  assumptions  about  whettier  it  is  safe  to  rearrange  load  and  store  instruc¬ 
tions.  Accordingly,  it  often  generates  a  itop  to  fill  a  load  delay,  when  in  fact  another  instruction  could 
be  placed  there.  It  also  has  some  problems  filling  branch  delays. 


^^Aceording  to  Larry  Wab«r  at  Mips,  the  reason  that  this  pass  takes  -•  long  is  due  to  the  assembler  front  end.  With  Mips 
compilers,  this  stage  is  bypassed  altogether,  with  the  compiler  back  errds  calling  the  assembler  middle  end  directly.  If  BCPL 
took  this  approach,  these  times  would  be  appreciably  reduced. 
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We  did  not  perform  a  full  analysis,  but  a  study  of  a  sample  of  the  1209  nop  instructions  generated 
suggests  that  about  70%  could  be  removed  by  a  reorganizer  that  had  more  information  about 
aliasing  and  block  structure,  reducing  the  count  to  about  360. 

Of  the  226  other  extra  instructions,  several  could  be  removed  by  combining  reorganization  with  code 
generation.  For  example,  the  reorganizer  always  loads  a  constant  operand  of  a  conditional  branch 
into  register  at;  the  code  generator  could  track  this  value  and/or  use  more  than  one  temporary 
register.  A  sampling  suggests  that  about  40%  of  these  instructions  could  be  removed,  leaving  about 
135. 

Taken  together,  all  these  changes  would  reduce  the  reorganization  penalty  to  about  500  instructions, 
or  a  little  less  than  9%. 

6.2.3.  Instruction  Set  Usage  -  Mips 

The  usage  pattern  for  Mips  instructions,  address  modes,  and  registers  is  given  in  the  following 
tables.  These  figures  are  for  code  compiled  from  a  simple  systems  implementation  language.  Su¬ 
perscripted  numbers  refer  to  notes  at  the  end  of  these  tables. 
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add 

125 

1.8% 

addi 

536 

7.6% 

addiu 

9 

0.1% 

addu 

7 

0.1% 

stib 

69 

1.0% 

multu 

5 

0.1% 

all 

84 

1.5%^ 

div 

7 

0.1% 

mflo 

10 

0.2%3 

mfhi 

2 

0.0% 

Arithmetic 

854 

12.1% 

nor 

8 

0.1% 

and 

5 

0.1% 

andi 

10 

0.2% 

or 

1 

0.0% 

ori 

1 

0.0% 

xor 

15 

0.2% 

sllv 

3 

0.1%^ 

srl 

6 

0.1% 

srlv 

2 

0.0% 

Logical 

51 

0.7% 

sit 

8 

0.1% 

sltiu 

11 

0.2% 

Boolean 

19 

0.3%^ 

Compute 

924 

13.1% 

Ibu 

6 

0.1% 

Iw 

1799 

25.6% 

li 

415 

5.9% 

lui 

11 

0.2% 

Load 

2231 

31 .7% 

sb 

9 

0.1% 

sw 

902 

12.8% 

Store 

911 

12.9% 

move 

262 

3.7%’ 

Move 

3404 

48.3% 

sit 

49 

0.7% 

slti 

33 

0.5% 

beq 

184 

2.6% 

bne 

201 

2.9% 

bltz 

31 

0.4% 

blez 

2 

0.0% 

bgtz 

0 

0% 

bgez 

14 

0.2% 

Cbranch 

514 

7.3% 

b 

320 

4.5% 

jr 

130 

1 .8%® 

Ubranch 

450 

6.4% 

bgezal 

1 

0.0% 

jalr 

539 

7.7% 

Call 

540 

7.7%® 

Control 

1504 

21 .4% 

Noop 

1209 

17.2% 

Total 

7041 

100% 

Table  6-4:  Instruction  Counts  -  Mips 


Notes 

1 .  A  move  from  one  register  to  another  is  suspect,  since  it  might  be  due  to  inadequate 
targeting.  These  instructions  were  checked  by  hand,  and  38  were  found  to  be  remov¬ 
able  by  arbitrarily  better  code  generation  (14%  of  the  moves  or  0.5%  of  the  code).  The 
remainder  were  genuine. 

2.  Most  left  shifts  are  optimizations  of  multiplication  and  are  counted  as  arithmetic;  only  a 
few  are  true  logical  shifts. 

3.  Every  mui  or  div  must  be  followed  by  a  mf lo  or  mfhi  to  collect  the  results  of  the 
product,  quotient,  or  remainder. 
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4.  This  is  a  false  picture.  In  several  places,  the  source  code  uses  a  conditional  statement 
returning  true  or  false  instead  of  a  pure  Boolean  expression.  Hand  checking  shows 
that  there  should  be  about  three  times  as  many  uses  of  these  instructions  (about  1% 
overall). 

5.  This  is  121  procedure  returns,  2  true  jumps,  and  7  case  statements  implemented  as 
jump  tables.  Each  procedure  has  exactly  one  return  jump:  in  order  to  improve  com¬ 
parability  with  cgvax,  we  inhibited  code  hoisting. 

6.  This  is  540  calls  in  121  procedures. 

If  the  nop  instructions  are  excluded,  the  instruction  mix  is  as  shown  in  figure  6-2: 


move 


Figure  6-2:  BCPL/Mips 

This  figure  shows  a  fairly  typical  pattern  for  a  load/store  machine.  The  approximate  breakdown  of 
60%  load/store,  15%  compute,  and  25%  control  is  not  unfamiliar.  However,  it  underlines  the  need 
for  good  data  caching  and  wide  bandwidth  to  memory.  The  high  proportion  of  control  instructions  is 
typical  of  systems  code;  it  shows  that  a  good  branch  cache  or  instruction  cache  is  desirable. 

The  proportion  of  stores  to  loads  is  about  2  :  5,  and  stores  are  about  27%  of  all  moves.  This  is  a  little 
higher  than  one  would  expect;  the  reason  is  that  register  tracking  is  eliminating  a  lot  of  loads.  In 
detail,  we  save: 

•  570  loads  of  0,+1  ,-1 

•  2  loads  of  other  values 

•  425  loads  from  memory 

for  a  total  of  997.  This  is  a  saving  of  over  30%. 

The  very  small  number  of  byte  loads  and  stores  seems  surprising,  since  the  program  being  analyzed 
reads  and  writes  text  files.  However,  almost  all  the  code  treats  strings  as  atomic  objects  passed  by 
reference;  only  in  a  few  primitive  routines  are  the  individual  characters  accessed. 

Overall,  the  only  instructions  that  seem  underused  are  the  Boolean  seq  group.  As  noted,  this  is 
partly  an  artificial  result;  but  at  best  they  would  be  used  only  1%  of  the  time.  However,  in  the  true 
Mips  architecture,  they  are  used  to  implement  some  of  the  branch  macro-instructions,  so  they  prob¬ 
ably  come  for  free.  Moreover,  if  they  were  absent,  the  code  sequence  that  the  compiler  would  have 
to  generate  would  be  quite  expensive. 


O  move  -  58.4% 

^  compute  - 1 5.8% 
H  control  -  25.8% 

M/500  Instruction  Mix 
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The  nor  instruction  was  never  generated  by  cgmips  even  though  a  specific  optimization  was  added 
to  look  for  a  chance  to  use  it.  It  appears  in  the  reorganized  code  only  as  the  translation  of  not. 


Address  Mode  Usage 


Conceptual 


Physical 


Constant 

2689 

21.2% 

Immediate 

2119 

16.7% 

Local 

1461 

11.5%’ 

Absolute 

0 

0.0%‘‘ 

Protocol 

865 

6.8%^ 

Register 

7822 

61.8% 

Static 

1105 

8.7% 

Based 

2716 

21.5%® 

Indirect 

297 

2.3%^ 

Temporary 

6240 

49.3% 

Total 

10166 

100% 

Total 

10166 

100% 

Table  6-5:  Address  Mode  Usage  -  Mips 
In  addition,  there  were  753  branch  targets. 

Offset  and  Constant  Sizes 


Constant,  0 

358 

Constant,  +1 

152 

Constant,  -1 

60® 

Immediate,  16  bits 

2115 

Immediate,  32  bits 

4 

All  Numbers 

2689 

Stack-based,  16  bits 

1467 

Stack-based,  32  bits 

0 

pointer-based,  no  offset 

120 

pointer-based,  16  bits 

177 

pointer-based,  32  bits 

0 

Table  6-6:  Offset  and  Constant  Sizes  -  Mips 

Notes 

1 .  The  distribution  between  value  moves  and  address  loads  is: 


Local 


Static 


Value 

Address 

Total 

1455 

6 

1461 

968 

137 

1105 

2.  This  refers  to  operations  implementing  the  procedure  entry/exit  protocol. 

3.  This  includes  all  pointer,  structure,  and  array  references. 

4.  The  Absolute  address  mode  cannot  be  generated  by  this  language. 

5.  Recall  that  -1  is  the  BCPL  representation  of  true. 

6.  Based  address  mode  is  of  the  form  displacement  (register) . 


This  data  is  a  compelling  vindication  of  the  RISC  design.  The  machine  has  just  three  data  address 
modes,  and  they  are  used  in  the  ratio  17%  ;  62%  :  21%.  In  tact,  register  tracking,  and  the  use  of 
three  registers  to  hold  constants,  inflates  the  middle  figure  -  for  unoptimized  code  the  pattern  would 
be  closer  to  21%  ;  53%  :  26%.  Note  that  50%  register  operands  is  the  minimum  possible,  since  a 
simple  assignment  generates  two  memory  references  and  two  register  references,  while  any  more 
complicated  expression  generates  more  register  references  than  memory  references. 


The  li  instruction  adds  a  16-bit  immediate  value  to  the  contents  of  a  register  and  loads  the  result. 
This  allows  it  to  serve  as  a  load  address  instruction,  and  it  appears  in  the  high  level  instruction  set  as 
la.  This  is  clearly  a  good  idea:  over  6%  of  references  to  based  addresses  use  this  idiom. 

Inspection  of  the  generated  code  shows  just  two  places  where  some  further  economy  could  be 
achieved; 

1 .  The  only  way  to  access  static  tables  with  a  short  address  mode  was  to  put  them  in  the 
•  sdata  segment,  even  though  they  were  conceptually  read-only.  This  could  be 
ameliorated  by  having  a  PC-relative  address  mode,  or  by  allowing  the  user  to  set  up 
global  base  registers. 

2.  The  majority  of  the  conditional  branches  were  of  the  form  "compare  register  and  small 
constant".  A  true  Mips  instruction  that  implemented 

<conditional-branch>  <register>  <iinraed>  <ciestlnjition> 

would  be  very  useful,  though  we  agree  it  would  be  hard  to  fit  into  32  bits.  With  the 
present  machine,  this  expands  into  two  true  instructions  (8  bytes):  by  contrast,  the 
same  idiom  on  the  Vax  usually  takes  5  bytes. 


The  pattern  found  for  immediate  values  and  offsets  confirms  that  a  mode  with  a  32-bit  offset  is 
unnecessary.  However,  It  would  be  helpful  if  the  Mips  assembler  allowed  the  programmer  to  use 
general  registers  as  global  base  registers,  instead  of  keeping  this  ability  strictly  to  itself  (and  not 
using  it  to  best  advantage). 


6.2.4.  Register  Usage  -  Mips 

The  Ocode  code  generator  uses  uO  through  ui3  as  accumulators  and  for  parameter  passing.  Up  to 
14  parameters  are  passed  In  registers;  any  additional  parameters  are  passed  on  the  stack.  Results 
are  returned  in  uO.  The  same  registers  are  used  in  round-robin  fashion  for  temporaries,  starting  with 
uO  for  the  first  temporary  of  each  basic  block.  All  registers  are  tracked  across  linear  code  and 
non-looping  control  structures.  All  accumulators  are  assumed  destroyed  by  a  procedure  call.  (This 
is  not  the  protocol  of  the  other  Mips  compilers.) 

The  registers  rz,  ru,  and  zm  always  hold  the  values  0,  -t-l,  and  -1,  respectively.  Register  rp  is  the 
Ocode  stack  pointer  ri  is  the  return  link  register,  and  rw  is  a  work  register. 
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Register  Usage 

Accumulators 

Special  Registers 

uO 

1778 

rz 

358 

(holds  0) 

Ul 

912 

ru 

152 

(holds  +1) 

u2 

542 

rm 

60 

(holds  -1 
or  TRUE) 

u3 

366 

rp 

2409 

(stack  pointer) 

u4 

250 

rw 

1078 

(temporary) 

u5 

157 

rl 

363 

(return  link) 

u6 

120 

gp 

1096 

(Mips  sdata 
base  register) 

u7 

79 

u8 

66 

u9 

50 

ulO 

38 

Ul1 

24 

u12 

21 

u13 

15 

Total 

4418 

Total 

5516 

Table  6-7:  Register  Usage  -  Mips 


The  accumulator  pattern  is.  as  expected,  very  close  to  a  negative-binomial  distribution.  It  illustrates 
well  the  way  benefits  rapidly  diminish  with  this  allocation  strategy.  Interprocedural  register  allocation 
would  be  a  better  (but  harder)  strategy. 


There  are  2409  references  to  the  Ocode  stack  pointer,  rp.  Of  these.  1461  are  accesses  to  local 
variables,  and  942  are  generated  by  471  instructions  to  raise  or  lower  the  stack.  The  stack  is  moved 
by  the  caller  before  and  after  every  procedure  call;  canonically  that  would  be  2*540  =  1080  moves, 
but  optimizations  remove  609  of  them  (56.8%),  giving  a  much  faster  protocol  than  the  conventional 
one  in  which  the  called  procedure  moves  the  stack. 


The  temporary  register  rw  is  used  during  a  procedure  call.  This  is  necessary  because  BCPL  calls 
procedures  indirectly  through  a  transfer  vector,  so  the  call  sequence  is: 

lir  xw,  procoffset  (vector) 

jelr  rw 


giving  2*539  =  1078  uses  of  rw  for  539  calls  of  external  procedures. 
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6.2.5.  Instruction  Set  Usage  -  VAX 

Here  are  the  same  data  for  the  Vax,  using  the  same  source  program,  but  compiled  by  bcpi  and 
cgvax.  It  is  much  harder  to  understand  these  tables,  since  there  are  many  special  idioms  that 
perform  actions  that  are  not  obvious.  For  example,  a  constant  can  be  loaded  into  a  register  by  any 


of  the  following: 

clrl 

rO 

movl 

#l,r0 

mcoml 

#63, rO 

movzbl 

#200, rO 

cvtbl 

#-100, rO 

movzwl 

#300, rO 

cvtwl 

#-200, rO 

and  a  constant  can  be  added  to  a  register  by: 

incl  rO 

decl  rO 

addl2  #2,r0 

subl2  #2,r0 

jnoval  100  (rO)  ,  rO 

Although  ti-ere  are  other  methods  for  achieving  these  results,  the  reader  is  assured  that  each  ex¬ 
ample  is  indeed  the  shortest  way  to  accomplish  that  operation  with  that  specific  constant. 

The  Ocode  code  generator  uses  jsb  exclusively  for  calls,  and  builds  its  own  stack  on  ri2.  There 
are  therefore  no  occurrences  of  callx,  pushx,  ret,  or  rsb. 
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movb 

1 

0.0% 

clrb 

1 

0.0% 

clrl 

137 

3.3% 

clrq 

1 

0.0% 

movl 

1194 

28.7% 

xnovq 

107 

2.6% 

mcomi 

58 

1.4%^ 

cvtbl 

0 

0.0% 

cvtwl 

0 

0.0% 

cvtlb 

7 

0.2% 

movzbi 

21 

0.5%^ 

movzwi 

5 

0.1% 

moval 

132 

3.1%^ 

Move 

1664 

40.0% 

mnegl 

9 

0.2% 

incl 

32 

0.8% 

addl 

300 

7.2% 

decl 

28 

0.7% 

subl 

212 

5.1% 

moval 

28 

0.7%3 

tstl 

44 

0^ 

r— 

mull 

21 

0.5% 

divl 

5 

0.1% 

amul 

2 

0.0% 

ediv 

2 

0.0% 

Arithmetic 

683 

16.4% 

mcomi 

8 

0.2%’ 

bisl 

1 

0.0% 

bid 

14 

0.3% 

xorl 

4 

0.1% 

rotl 

0 

0.0% 

ashl 

3 

0.1% 

ashq 

1 

0.0% 

extv 

0 

0.0% 

extzv 

7 

0.2% 

insv 

0 

0.0% 

Logical 

38 

0.9% 

Compute 

721 

17.3% 

tstl 

105 

2.5%-’ 

cxnpl 

266 

6.4% 

bitl 

0 

0.0% 

cn^v 

0 

0.0% 

cmpzv 

0 

0.0% 

Compare 

371 

8.9% 

beql 

99 

2.4% 

bneq 

182 

4.4% 

blss 

38 

0.9% 

bgtr 

28 

0.7% 

bleq 

32 

0.8% 

bgeq 

24 

0.6% 

blssu 

0 

0.0%® 

bgtru 

0 

0.0% 

blequ 

0 

0.0% 

bgequ 

0 

0.0% 

easel 

12 

0.3%® 

Cbranch 

415 

10.0% 

brb 

257 

6.2% 

brw 

74 

1.8% 

ja9> 

121 

2.9%^ 

Ubranch 

452 

10.9% 

bsbb 

0 

0.0% 

bsbw 

1 

0.0% 

jsb 

539 

13.0% 

Call 

540 

13.0%® 

Control 

1778 

42.7% 

Total 

4163 

100% 

Table  6-8:  Instruction  Usage  -  VAX 


Notes 

1 .  Most  uses  of  mcomi  are  to  load  a  small  negative  number,  and  these  are  considered 
moves.  A  few  are  genuine  bitwise  complement  operations,  and  so  are  considered 
logical. 

2.  Most  uses  of  movzbi,  and  all  uses  of  movzwi,  are  to  load  medium-sized  constants. 
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3.  Some  uses  of  movai  were  to  add  a  constant  to  a  register,  and  these  are  considered 
adds.  The  remainder  are  genuine  moves  of  addresses. 

4.  The  idiom  tsti  (ri2)+  is  sometimes  used  to  add  4  to  the  Ocode  stack  pointer. 

These  are  considered  additions;  the  other  occurrences  of  tstl  are  true  tests. 

5.  This  language  cannot  generate  unsigned  comparisons. 

6.  That  is,  one  for  each  case  statement  implemented  as  a  jump  table.  The  code  gener¬ 
ator  algorithm  for  choosing  between  a  table  and  a  sequence  of  tests  depends  on  the 
number  of  cases  and  their  sparsity.  Since  the  Vax  form  of  the  jump  table  is  half  the 
size  of  the  Mips  form,  this  algorithm  chooses  jump  tables  more  often  on  the  Vax. 

7.  The  jmp  instruction  is  used  only  to  implement  a  procedure  return.  This  version  of 
cgvax  did  not  support  any  code  hoisting;  there  are  therefore  121  returns  in  121  proce¬ 
dures. 

8.  This  is  540  calls  in  121  procedures. 

This  is  a  different  pattern  from  that  found  on  the  Mips.  The  main  differences  of  interest  are  dis¬ 
cussed  in  the  next  section. 


Conceptual 

Address  Mode  Usage 

Physical 

Constant 

1380 

22.5%’ 

Literal 

982 

1 6.4% 

Local 

1181 

19.2%^ 

Indexed 

68 

1.1% 

Protocol 

363 

6.0% 

Register 

1946 

32.5% 

Static 

1094 

17.8%^ 

Reg  deferred 

47 

0.8%® 

Indirect 

245 

4.0% 

Auto  Decrement 

124 

2.1%® 

Indexed 

68 

1.1%^ 

Auto  Increment 

143 

2.4%® 

Temporary  1808 

29.4% 

Autoinc  deferred 

121 

2.0%® 

Displacement 

1798 

30.0% 

Disp  deferred 

539 

9.0%^ 

Immediate 

93 

1 .6% 

Absolute 

0 

Relative 

126 

2.1% 

Rel  deferred 

0 

Total 

6139 

100% 

Total 

5987 

100% 

Table  6-9:  Address  Mode  Usage  -  Vax 
In  addition,  there  are  720  branch  addresses. 
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Offset  and  Constant  Sizes 

Constant.  0(clr/tst) 

245 

Constant.  +1  (inc/dec) 

60® 

Literal,  6  bits 

982 

Immediate,  8  bits 

84 

Immediate,  16  bits 

5 

Immediate,  32  bits 

4 

Stack-based,  8  bits 

1146 

Stack-based,  16  bits 

35 

Stack-based,  32  bits 

0 

Pointer-based,  no  offset 

47 

Pointer-based,  8  bits 

175 

Pointer-based,  16  bits 

23 

Pointer-based,  32  bits 

0 

Table  6-10:  Offset  and  Constant  Sizes  -  Vax 

Notes 

1.  Of  these,  245  zeros  are  elided  into  dr  or  tst  instructions,  and  60  occun’ences  of 
unity  are  elided  into  inc  or  dec,  leaving  1075  explicit  constant  operands. 

2.  Of  these,  1172  are  to  load  or  store  plain  values,  6  are  to  generate  the  address  of  local 
variables,  and  3  are  indirect  accesses  through  a  local  pointer. 

3.  Of  these,  422  are  plain  loads  and  stores,  126  are  to  generate  addresses,  and  536  are 
indirect  accesses  through  a  pointer. 

4.  An  indexed  operand  also  has  a  base  address  that  is  a  second  logical  operand.  The 
base  operands  are  distributed  thus: 

•  Temporary  -  2 

•  Static  pointer  -  35 

•  Local  pointer  -  31 . 

5.  Plus  2  that  are  base  addresses  of  index  mode. 

6.  These  modes  are  never  generated  for  true  operand  access.  They  occur  only  as  part  of 
the  procedure  entry/exit  protocol  and  as  special  idioms. 

7.  Plus  66  that  are  base  addresses  of  Index  mode. 

8.  It  is  advantageous  on  the  Vax  to  avoid  small  negative  numbers,  e.g.,: 

addl  #-l,r0  -»  subl  #l,rO 

aovl  ff-l,z  -*  sncoml  #0,3C 

Hence,  the  constant  -1  rarely  occurs  in  8ie  generated  code. 


CMU/SEI-87-TR-29 


69 


These  statistics  are  very  confusing.  However,  two  things  seem  clear.  First,  the  8-bit  offset  mode, 
and  the  6-  and  8-bit  literal  modes,  amply  justify  themselves.  They  account  for  96%  of  all  offsets  and 
99%  of  all  literals. 

Secondly,  the  majority  of  the  address  modes  are  hardly  ever  used.  If  we  exclude  the  modes  gener¬ 
ated  only  by  hand-crafted  protocol  sequences,  then  just  three  modes  -  literal,  register,  and  displace¬ 
ment  -  account  for  almost  80%  of  all  operands.  The  most  significant  remaining  mode  -  displace¬ 
ment  deferred  -  is  generated  only  by  the  BCPL  calling  sequence  for  external  procedures. 

6.2.6.  Register  Usage  -  Vax 

The  Ocode  code  generator  uses  rO  through  r7  as  accumulators  and  for  parameter  passing.  Up  to  8 
parameters  are  passed  in  registers;  any  additional  parameters  are  passed  on  the  stack.  Results  are 
returned  in  rO.  The  same  registers  are  used  in  round-robin  fashion  for  temporaries,  with  the  exact 
same  conventions  as  on  the  Mips.  The  register  pair  <r7,  r8>  is  used  as  a  special  accumulator  for 
instructions  that  require  two  registers,  such  as  ashq  or  ediv. 

Register  ri2  is  the  Ocode  stack  pointer,  used  to  address  local  variables.  Register  rio  is  the  static 
database  pointer,  which  has  much  the  same  purpose  as  gp  on  the  Mips.  The  hardware  stack 
pointed  to  by  sp  is  never  used,  but  the  return  link  must  be  popped  off  it  on  entry  to  every  procedure, 
hence  there  are  1 21  references. 


Register  Usage 

Accumulators  Special  Registers 

TO 

1187 

no 

993 

r1 

369 

r12 

1941 

r2 

141 

sp 

121 

r3 

56 

r4 

29 

r5 

15 

r6 

0 

r7 

8 

r8 

3 

Total 

1808 

Total 

3055 

TaWe  6-1 1 :  Register  Usage  -  Vax 


Once  again,  the  pattern  of  accumulator  usage  is  typical.  Vax  code  generation  seems  to  require  far 
fewer  registers  than  Mips  code  generation,  but  this  is  largely  a  figment  of  the  round-robin  strategy 
which  tries  to  avoid  reusing  registers  when  fresh  ones  are  available.  Further  analysis  shows  that  6 
registers  on  Vax,  or  8  on  Mips,  would  be  enough  to  allow  both  good  register  tracking  and  efficient 
expression  evaluation.  Note,  however,  that  the  code  genr  ator  does  not  bind  local  variables  to 
registers. 


70 


CMU/SEI-87-TR-25 


There  are  1941  references  to  the  Ocode  stack  pointer,  ri2.  Of  these.  1181  are  references  to  local 
variables,  242  are  part  of  the  procedure  entry /exit  protocol,  and  the  rest  are  generated  by  475  in¬ 
structions  to  move  the  stack.  The  strategy  on  the  Vax  is  the  same  as  on  Mips:  the  caller  moves  the 
stack,  and  of  the  canonical  1080  moves  required  by  540  calls,  the  code  generator  can  remove  605 
(56.2%).  This  optimization  leaves  the  stack  pointer  biased  between  two  successive  procedure  calls; 
access  to  local  variables  then  uses  a  negative  offset  from  it.  The  possibility  of  this  optimization  is 
one  reason  for  preferring  offsets  from  a  base  register  to  be  signed. 

6.2.7.  Architectural  Comparison 
6.2.7. 1.  Move  Versus  Load/Store 

The  VAX  instruction  breakdown  shows  far  fewer  move  instructions.  This  is,  of  course,  because  the 
Mir  3  is  a  load/store  machine,  whereas  the  Vax  may  be  used  as  a  multi-address  machine.  Thus,  a 
simple  copy 
a  :=  b 

is  two  instructions  on  Mips 

Iw  reg, a 

sw  reg,b 

but  one  instruction  on  the  Vax 

movl  a , b 

And  a  simple  addition 
a  :=  a+b 

is  four  instructions  on  Mips 

Iw  rl,a 

Iw  r2,b 

add  r2,r2,rl 

sw  r2,a 

but  again  only  one  instruction  on  the  Vax 
addl2  b,a 

The  effect  of  this  is  to  inflate  the  number  of  moves.  However,  this  effect  is  mitigated  by  optimization. 
If  the  value  in  A  is  to  be  used  again,  it  is  often  better  to  compute  that  value  in  a  register: 

addlS  a,b,r0 

atovl  rO ,  a 

A  sampling  of  the  code  shows  that  about  one-half  of  the  'extra*  Mips  moves  were  used  to  load  the 
right  operand  of  an  operation,  confirming  the  popular  view  that  a  general-register  one-address  organ¬ 
ization  is  the  best  compromise  between  instruction  density  and  simplicity.  A  load/store  machine 
generates  mere  instructions  but  can  perform  better  overall  because,  since  the  fetch  of  an  operand  is 
not  tightly  coupled  to  its  use,  the  fetch  delay  can  be  overlapped  with  useful  work. 
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6.2.7.2.  Three-Address  Idiom 

Another  feature  of  the  Vax  is  the  "three-address"  instructions  that  allow  one,  for  example,  to  translate 
a  :=  b  c 


as 

addl3  b,c,a 

These  are  available  for  most  dyadic  operations,  and  their  pattern  of  use  is  shown  in  table  6-12. 


Instruction 

3-address 

Total 

addl 

38 

300 

subl 

38 

212 

mull 

8 

21 

divl 

5 

5 

bisl 

0 

1 

bid 

11 

14 

xorl 

2 

4 

Total 

102  (18.3%) 

557 

Table  6-12:  Three  Address  Mode  Usage 

The  code  generator  tries  very  hard  to  generate  the  three-address  form  to  save  register  traffic.  How¬ 
ever,  on  the  basis  of  these  figures,  it  is  barely  worth  having:  it  saved  about  6%  of  the  move  instruc¬ 
tions. 

6.2.7.3.  Condition  Codes  and  Branches 

The  pattern  of  conditional  branches  is  slightly  different  between  Mips  and  Vax.  This  is  because 
cgvax  looks  for  idioms  such  as: 

•  X  >  1  ^  X  >  0  (saves  1  byte) 

•  X  >  64  -»  X  >  63  (saves  4  bytes) 

There  are  fewer  branches  overall  because  the  Vax  code  implements  more  case  statements  as  jump 
tables,  and  because  the  Vax  case  instruction  includes  a  range  check. 

However,  the  Vax  code  has  a  far  higher  proportion  of  control  instructions.  One  reason  is  that  there 
are  fewer  instructions  overall,  so  the  same  number  of  control  transfers  is  a  larger  proportion.  But 
there  are  also  absolutely  more  such  instructions:  1778  versus  1504. 

This  difference  is  almost  entirely  because  of  the  tsti  and  cmpi  Instructions.  The  code  generator 
slaves  the  condKion  codes  religiously,  both  through  linear  code  and  across  control  transfers.  Never¬ 
theless,  of  415  conditional  branches,  371  (almost  90%)  required  a  prior  test  or  compare  to  set  the 
condition  codes;  of  2169  normal  instructions  that  set  the  condition  codes,  only  44  (about  2%)  did  so 
to  any  purpose.  It  is  hard  to  avoid  the  conclusion  that  condition  codes  are  a  waste  of  time,  effort, 
and  silicon. 
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The  Mips  machine  is  not  perfect,  however.  First,  because  a  full  set  of  conditional  branches  is  not 
available,  82  "set"  instructions  had  to  be  generated  to  prepare  for  432  branches.  There  is  another, 
more  difficult,  problem.  The  most  common  kind  of  test  in  the  program  being  analyzed  is: 

IF  <coinponent-of-structure>  =  <small-constant>  THEN... 

where  the  small  constant  represents  a  value  of  a  scalar  type.  If  we  assume  that  a  pointer  to  the 
structure  is  already  in  a  register,  then  the  Vax  code  looks  like: 

cmpl  offset (rl) ,  #eonstant 
bneq  else 

which  is  2  instructions  and  6  bytes.  The  Mips  code  looks  like: 

Iw  u2,  offset (ul) 

11  at,  constant 

bneq  u2,  at,  else 

which  is  3  instructions  and  12  bytes.  The  first  instruction  is  a  consequence  of  the  load/store  architec¬ 
ture,  and  can  sometimes  be  optimized  out.  The  second  instruction  is  necessary  because  the  branch 
operations  do  not  take  an  immediate  operand.  Note  also  that  it  might  be  necessary  to  append  a 
no-op  after  the  branch. 

The  Mips  instructions  have  room  for  a  16-bit  relative  branch,  or  a  16-bit  immediate  operand,  but  not 
both.  The  VAX  code  is  much  denser,  in  part  because  it  uses  an  8-bit  field  for  both  the  relative  branch 
and  the  immediate  operand;  for  the  Mips  to  achieve  a  density  it  would  have  to  either  abandon  the 
fixed  32-bit  instruction  format  or  use  smaller  field  sizes  in  this  special  case. 

It  seems  that  smaller  field  sizes  would  improve  code  density:  over  95%  of  constants  would  fit,  and 
over  90%  of  branch  destinations  (in  fact,  12%  of  the  branches  on  the  Vax  are  not  within  8-bit  range, 
but  that  is  with  a  relative  byte  address;  Mips  uses  a  relative  word  address).  These  branch  instruc¬ 
tions  are  perhaps  the  part  of  the  Mips  order  code  that  suffers  most  from  the  simplification  of  the 
RISC  design. 


6.2.7.4.  index  Mode 

The  number  of  adds  and  left  shifts  is  much  lower  on  the  Vax  because  of  the  scaled-index  mode.  For 
a  simple  language,  where  nearly  all  arrays  have  word-sized  components,  this  mode  can  be  used  for 
most  array  references.  For  example,  to  load  the  value  of  b  ry  [  i  ]  into  a  register: 

VAX: 


movl 

i,  rx 

movl 

gary [rx] , rO 

Mips : 


Iw 

rx,  X 

all 

rx, rx, 2 

add 

rx, rx, ary 

Iw 

uO, 0 (rx) 

There  are  in  fact  68  occurrences  of  this  mode  in  a  total  of  3930  operand  references  (1 

The  existence  of  a  scaled-index  mode  cannot  be  justified  by  these  figures.  But  this  is  systems  code. 
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which  has  few  arrays  references.  However,  compilers  for  scientific  languages  implemenf  very  so¬ 
phisticated  loop  induction  optimizations,  which  tend  to  eliminate  the  need  for  index  scaling. 
Moreover,  the  mode  is  useless  for  arrays  whose  components  are  size  other  than  1 ,2,4,  or  8  bytes. 

6.2.8.  Local  Conclusions 

We  can  draw  the  following  tentative  conclusions: 

1 .  For  simple  systems  programming,  a  RISC  machine  is  as  effective  as  a  CISC  machine, 
and  potentially  a  lot  faster. 

2.  Leaving  aside  code  reorganization,  it  is  certainly  no  harder  to  generate  code  for  a 
RISC  machine,  and  in  many  respects  it  is  easier.  Moreover,  a  preliminary  study  of  the 
problem  suggests  that  reasonable  code  reorganization  can  be  added  with  little  extra 
effort. 

3.  In  the  main  areas  where  RISC  machines  differ  from  CISC  -  simpler  instructions,  fewer 
address  modes,  no  side  effects  -  the  RISC  design  is  rarely  inferior  and  usually  supe¬ 
rior. 

4.  However,  the  basic  system  software  of  the  machine  must  be  fast  and  efficient. 

In  addition,  the  specific  claims  made  about  the  RISC  machine  under  study  are  corroborated  by  this 
work. 


6.3.  Dynamic  Analysis  of  Compilers 

In  this  section  we  describe  a  brute-force  analysis  of  the  code  generated  by  the  Mips  and  Vax  compil¬ 
ers.  This  section  contrasts  the  approach  taken  in  section  6.2,  which  instrumented  a  compiler  and 
examined  the  output.  Here,  we  look  at  the  instruction  mix  that  is  output  by  the  compilers  in  response 
to  two  sets  of  input:  a  set  of  integer  application  programs,  and  a  single  large  floating  point  appli¬ 
cation. 

6.3.1.  Instruction  Use  by  Integer  Applications 

The  first  test  we  subjected  the  compilers  to  was  the  compilation  of  a  set  of  integer  application  pro¬ 
grams.  The  three  programs  we  chose  were: 

1.  csh  -  the  Unix  C-Shell.  This  program  is  a  command  interpreter  whose  function  is  to 
scan  user  commands  and  run  system  and  user  programs.  This  program  consists  of 
nearly  1 6,00  lines  of  source  code  and  comments. 

2.  vi  -  a  Unix  visual  editor.  This  program  is  a  terminal-independent  screen  editor.  It 
provides  all  of  the  standard  editor  functions  in  a  screen  optimal  fashion,  updating  as 
each  change  is  made.  This  program  contains  over  20,000  lines  of  source  code  and 
comments. 

3.  uboat  -  a  proprietary  authoring  language.  This  program  provides  a  terminal  inde¬ 
pendent  foundation  for  writing  computer  aided  courseware,  menu  systems,  and  dem¬ 
onstration  drivers.  It  contains  of  almost  10,000  lines  of  source  code  and  comments. 

These  three  programs  were  chosen  as  reasonable  representatives  of  integer-based  application  pro¬ 
grams.  By  their  very  nature,  none  of  them  are  highly  compute  intensive,  although  they  do  perform  a 
great  deal  of  data  manipulation.  We  present  the  statistics  for  the  three  programs  together,  rather 
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than  inundating  the  reader  with  individual  analyses.  In  truth,  the  compiler  generated  roughly  the 
same  instruction  mix  for  each  program,  so  we  present  the  average  mix  for  each  compiler. 

6.3.1 .1 .  Analysis  of  Mips  C  Compiler 

Table  6-13  shows  the  instruction  mix  generated  for  the  three  integer  applications.  We  list  the  actual 
Mips  M/500  instructions  that  were  generated  instead  of  the  high-level  instruction  set.  The  reason  for 
this  is  that  the  low-level  instructions  are  the  ones  that  are  actually  executed,  thus,  their  frequency  of 
occurrence  is  much  more  significant  than  the  high  level  macro  instructions.'*® 


^*Th«  instrvwtion  count*  shown  in  tabl«  6-13  correspond  to  the  output  of  the  compiler  at  optimization  level  4  (this  includes 
cross  module  register  allocation  and  optimization,  and  requires  that  all  modules  be  compiled  together).  We  also  did  not  count 
the  instructions  in  the  run-time  libraries  or  the  c  initialization  or  finalization  routines. 
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lb 

13 

0.02% 

Ibu 

1778 

2.32% 

Ih 

2130 

2.77% 

Ihu 

159 

0.21% 

li 

3871 

5.04% 

lui 

1259 

1 .64% 

Iw 

9650 

12.57% 

Iwcl 

6 

0.01% 

Load 

18866 

24.57% 

sb 

930 

1.21% 

sh 

1213 

1.58% 

sw 

5656 

7.37% 

swcl 

2 

0.00% 

Store 

7801 

10.16% 

cvt .  d .  s 

14 

0.02% 

cvt . s  .  d 

5 

0.01% 

cvt . w . d 

3 

mfcl 

3 

mfhi 

40 

0.05% 

mflo 

103 

0.13% 

move 

6635 

8.64% 

mtcl 

11 

0.01% 

Shuffle 

6814 

8.87% 

Move 

33481 

43.61% 

add.  d 

3 

0.00% 

addiu 

7032 

9.16% 

addu 

1098 

1.43% 

div 

85 

0.11% 

divu 

3 

0.00% 

mul .  d 

5 

0.01% 

multu 

42 

0.05% 

sll 

1367 

1.78% 

sllv 

10 

0.01% 

sra 

762 

0.99% 

srav 

4 

0.01% 

srl 

5 

0.01% 

siibu 

549 

0.72% 

Arithmetic 

10965 

14.34% 

Table  6-1 3:  Integer  Application  Instruction  Usage  -  Mips 

The  first  observation  we  make  is  that  there  is  a  distressingly  large  number  of  nop  instructions  in  the 
final  executable  code  -  over  14%  of  the  total  instruction  count  are  nops.  Figure  6-3  displays  the 
instruction  mix  in  graphical  form. 


Although  the  Mips  compiler  is  generating  fairly  good  code,  a  more  sophisticated  code  generator 
could  create  programs  that  run  another  10%  faster  (based  on  the  work  decsribed  in  section  6.2.2, 
page  60).^® 


should  bo  notod  that  while  the  instruction  mk  shown  in  figure  6-13  represerrts  the  instruction  frequencies  that  are 
present  in  the  executable  image,  and  not  necessarily  the  frequency  of  instructions  that  are  executed,  we  have  found  that  the 
concurrence  between  these  two  figures  is  usually  very  high. 
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control 


compute 


move 


G  move  -  43.6% 

B  compute  - 18.1% 
G  control  -  23.9% 
fl  nop  - 1 4.4% 


Figure  6-3:  Instruction  Distribution  -  Integer  Applications 


When  the  nop  instructions  are  excluded,  the  resulting  instruction  mix  follows  the  pattern  shown  in 
figure  6-4. 


Figure  6-4:  Instruction  Distribution 


□  move  -  50.9% 

B  compute -21.1% 
B  control  -  27.9% 


-  Integer  Applications  (Minus  nops) 


This  instruction  breakdown  correlates  roughly  with  the  mix  shown  in  figure  6-2  on  page  62.  In  these 
examples,  however,  the  ratio  of  compute  instructions  to  move  instructions  is  somewhat  higher  than 
the  standard  15  :  25  mix  of  a  load/store  architecture.  This  is  attributable  to  three  factors: 

1 .  The  applications  use  more  dynamic  (i.e.,  local  or  register)  variables  than  static  vari¬ 
ables;  thus,  fewer  load/store  operations  are  necessary. 

2.  The  applications  themselves  are  performing  more  computational  actions  than  the  stan¬ 
dard  program. 

3.  Perhaps  more  likely  the  Mips  compilers  are  sufficiently  well  tuned  to  efficiently  reduce 
the  total  number  of  load/store  operations  that  need  to  be  performed,  and  instead  turn 
the  major  effort  more  towards  actual  computation.  We  feel  that  this  is  a  more  likely 
explanation,  since  the  level  4  optimizer  has  an  interprocedural  optimizer  and  register 
allocation  mechanism  (a  feature  that  is  lacking  in  the  BCPL  compiler  discussed  in 
section  6.2). 


The  address  mode  usage  by  these  integer  applications  is  shown  in  table  6-1 4.  These  figures  cor¬ 
respond  very  closely  to  those  in  table  6-5  on  page  63.  This  is  not  surprising,  since  all  the  application 
programs  are  similar  in  their  general  nature. 
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Table  6-14:  Address  Mode  Usage  -  Integer  Applications  on  Mips 


Examining  the  register  usage  pattern  shown  in  table  6-15  shows  a  number  of  interesting  things: 

•  The  large  number  of  direct  references  to  the  stack  pointer  sp  is  caused  by  every  routine 
moving  the  stack  at  entry  and  exit  with  an  addiu  instruction.  There  are  two  references 
to  sp  per  instruction,  and  if  the  number  of  references  is  divided  by  4,  the  result  is 
3660  /  4  =  91 5,  the  number  of  procedures  defined  in  the  three  applications. 

•  The  even  larger  number  of  indirect  references  to  sp  are  caused  by  saving  and  restoring 
local  registers  per  procedure. 

•  Because  the  compiler  partitions  registers  into  classes  (rather  than  considering  them 
identical),  some  observations  about  register  usage  are  muddied.  However,  we  may 
make  the  following  general  statements; 

•  Temporary  registers  are  allocated  on  a  round-robin  basis,  and  so  show  a  fairly 
uniform  distribution  of  use.  Registers  t6,  t7.  and  t8  are  usually  allocated  first, 
and  thus  their  reference  count  is  a  fraction  higher  than  the  other  temporaries. 

•The  saved  registers  sO  through  s9  are  used  for  local  variables  and  must  be 
saved  across  procedure  calls.  They  are  allocated  in  order  and  show  a  steadily 
decreasing  frequency  of  reference  from  sO  through  s9,  indicating  that  there  are 
more  procedures  with  a  small  number  of  local  variables  than  there  are  with  a 
large  number  of  them. 

•  The  number  of  references  to  the  argument  registers  follows  a  pattern  that  sup¬ 
ports  our  statements  in  footnote  on  page  40.  to  the  effect  that  most  procedures 
are  called  with  four  or  less  parameters.  The  number  of  references  to  a3  (the 
fourth  parameter  register)  account  for  less  than  3%  of  the  total  references  to  the 
argument  registers. 


•  The  assembler  temporary  register  at  is  used  in  8%  of  all  direct  register  references, 
indicating  a  fairly  high  percentage  of  interaction  with  the  assembler  reorganizer.  It  is 
likely  that  a  large  fraction  of  these  references  (and  their  instructions)  could  be  eliminated 
were  the  compilers  to  deal  directly  with  the  low  level  instruction  set,  instead  of  the  high- 
level  macro  instruction  set. 


•  The  kernel  registers  kO  and  ki  are  never  referenced  (not  surprisingly).  They  are  used 
exclusively  by  the  Mips  Unix  kernel. 


Register 

Integer 

Value 

Offset 

Floating  Point 
Register 

Total 

zero 

7431 

0 

fO 

2 

at 

7164 

514 

fi 

0 

vO 

8732 

419 

£2 

0 

vl 

4015 

241 

£3 

0 

aO 

7668 

290 

£4 

10 

al 

3870 

110 

£5 

0 

a2 

1634 

57 

£6 

10 

a3 

372 

16 

£7 

2 

to 

1681 

99 

£8 

8 

tl 

1526 

136 

£9 

0 

t2 

1355 

79 

£10 

8 

t3 

1310 

62 

£11 

0 

t4 

1283 

65 

£12 

0 

t5 

1168 

73 

£13 

0 

t6 

2659 

142 

£14 

0 

t7 

2377 

116 

£15 

0 

sO 

6329 

977 

£16 

14 

si 

4196 

759 

£17 

2 

s2 

3150 

265 

£18 

8 

s3 

2193 

149 

£19 

2 

s4 

1417 

83 

£20 

20 

s5 

1166 

58 

£21 

4 

s6 

875 

19 

£22 

0 

s7 

708 

13 

£23 

0 

t8 

2070 

114 

£24 

0 

t9 

1891 

112 

£25 

0 

kO 

0 

0 

£26 

0 

kl 

0 

0 

£27 

0 

9P 

1745 

7705 

£28 

0 

«P 

3660 

8853 

£29 

0 

£p/s8 

577 

11 

£30 

0 

ra 

2823 

0 

£31 

12 

hi 

40 

0 

lo 

0 

0 

Total 

87075 

21537 

Total 

102 

Table  6-1 5:  Register  Usage  -  Integer  Applications  on  Mips 
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6.3.1 .2.  Comparison  with  VAX  Unix  C  Compiler 

When  the  same  integer  application  programs  were  fed  through  the  Berkeley  Vax  C  compiler,  the 
instruction  mix  that  was  observed  is  shown  in  table  6-16. 


clrb 

342 

0.8% 

clrl 

830 

1 .8% 

clrw 

293 

0.6% 

cvtbl 

566 

1 .2% 

cvtbw 

9 

0.0% 

cvtdf 

2 

0.0% 

cvtdl 

2 

0.0% 

cvtfd 

7 

0.0% 

cvtlb 

361 

0.8% 

cvtld 

6 

0.0% 

cvtlw 

521 

1.1% 

cvtwb 

5 

0.0% 

cvtwl 

1296 

2.8% 

meowil 

13 

0.0% 

mnegb 

8 

0.0% 

mnegl 

93 

0.2% 

mnegw 

32 

0.1% 

movab 

118 

0.3% 

moval 

649 

1 .4% 

movaq 

2 

0.0% 

movb 

219 

0.5% 

movc3 

15 

0.0% 

movd 

5 

0.0% 

movl 

4689 

10.3% 

movq 

2 

0.0% 

mow 

137 

0.3% 

movzbl 

107 

0.2% 

movzbw 

3 

0.0% 

movzwl 

26 

0.1% 

pushab 

6 

0.0% 

pushal 

2238 

4.9% 

pushl 

4986 

10.9% 

Move 

17588 

38.6% 

bitb 

25 

0.1% 

bitl 

22 

0.0% 

bitw 

71 

0.2% 

Qnpb 

420 

0.9% 

cmpl 

2428 

5.3% 

cmpw 

252 

0.6% 

tstb 

641 

1 .4% 

tstl 

1756 

3.9% 

tstw 

691 

1 .5% 

Compare 

6306 

13.8% 

acbl 

7 

0.0% 

aobleq 

5 

0.0% 

aoblss 

14 

0.0% 

easel 

30 

0.1% 

jbe 

181 

0.4% 

jbcc 

9 

0.0% 

jbs 

83 

0.2% 

jbss 

42 

0.1% 

jeql 

2862 

6.3% 

jgeq 

344 

0.8% 

jgequ 

3 

0.0% 

jgtr 

349 

0.8% 

jgtru 

5 

0.0% 

jibe 

39 

0.1% 

jibs 

13 

0.0% 

jleq 

383 

0.8% 

jlequ 

1 

0.0% 

jlss 

422 

0.9% 

jneq 

2684 

5.9% 

sobgeq 

12 

0.0% 

sobgtz 

7 

0.0% 

Cbranch 

7495 

16.4% 

^1 

UBranch 

3971 

8.7% 

calls 

5930 

13.0% 

Control 

23702 

52.0% 

Total 

4SS49 

100% 

addd2 

4 

0.0% 

addl2 

638 

1 .4% 

addl3 

486 

1.1% 

deeb 

1 

0.0% 

deel 

366 

0.8% 

deew 

29 

0.1% 

dlvd2 

2 

0.0% 

divdS 

1 

0.0% 

di.vl2 

73 

0.2% 

divl3 

104 

0.2% 

ineb 

33 

0.1% 

inel 

640 

1 .4% 

inew 

81 

0.2% 

ffluld2 

5 

0.0% 

muld3 

2 

0.0% 

ffiull2 

154 

0.3% 

mull3 

158 

0.3% 

subd3 

2 

0.0% 

s\ibl2 

459 

1 .0% 

subl3 

530 

1 .2% 

Arithmetic 

3768 

8.2% 

ashl 

226 

0.5% 

bieb2 

4 

0.0% 

biel2 

49 

0.1% 

biel3 

61 

0.1% 

biew2 

24 

0.1% 

bisb2 

2 

0.0% 

bisl2 

33 

0.1% 

bi3l3 

16 

0.0% 

bisw2 

40 

0.1% 

extzv 

25 

0.1% 

xorb2 

2 

0.0% 

xorl2 

6 

0.0% 

xorl3 

3 

0.0% 

Logical 

491 

1.0% 

Compute 

4259 

9.3% 

Table  6'1 6:  integer  Application  Instruction  Usage  -  Vax 
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The  slightly  lower  percentage  of  move  class  instructions  on  the  Vax  is  predictable,  since  the  Vax  is 
not  a  load/store  architecture.  However,  the  38.6%  figure  is  still  higher  than  expected.  What  is  most 
surprising  is  the  markedly  decreased  number  of  compute  instructions  -  a  figure  we  expected  to  see 
increase  when  the  move  instructions  decreased.  The  two  fractions  can  be  brought  more  at  a  par 
with  each  other  when  it  is  remembered  that  many  of  the  addiu  instructions  on  the  Mips  M/500  are 
used  to  calculate  addresses,  not  actual  numeric  results. 

The  number  of  call  instructions  is  roughly  the  same  on  the  Vax  and  the  Mips  M/500,  although  due  to 
the  decreased  number  of  instructions  required  on  the  CISC  Vax,  they  comprise  a  larger  percentage 
of  the  total.  The  larger  fraction  of  conditional  branches  on  the  Vax  is  compensated  somewhat  on  the 
Mips  M/500  by  breaking  conditionals  into  two  parts,  half  of  which  are  considered  under  booleans. 

What  is  most  interesting,  however,  is  the  under-use  of  the  Vax  instruction  set.  Many  instructions  are 
used  only  0.1%  or  0.2%  of  the  time,  indicating  that  a  large  amount  of  hardware  effort  is  being  spent 
for  a  very  small  software  gain.  When  one  considers  the  frequency  with  which  the  three  operand 
address  mode  is  used  (shown  in  figure  6-5),  we  see  that  many  features  of  the  CISC  instruction  set 
are  simply  not  used  effectively  at  all. 


□  1  Operand-  51.1% 
ES  2  Operand  -  44.4% 
H  3  Operand  •  4.5% 


Figure  6-5:  Operand  Type  -  Integer  Applications  on  Vax 

To  further  demonstrate  this  point,  examine  table  6-17,  which  shows  the  frequency  of  use  of  the 
various  modes  available  on  the  Vax.  When  the  Vax  was  first  produced,  the  indexed  addressing 
modes  were  claimed  to  be  highly  beneficial  in  array  accessing.  Hrjwever,  the  indexed  addressmij 
modes  are  used  little  more  than  0.6%  of  the  time.  Other  addressing  modes  are  similarly  underused. 
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Address  mode  coverage 

Address  Mode  Example  Count  Percentage 


Immediate 

$270 

3701 

5.4% 

Literal 

$24  (n  <  64) 

9518 

14.0% 

Absolute 

$*label 

0 

0.0%  ^ 

Absolute  Indexed 

$*labellr4] 

0 

0.0% 

Relative 

label 

26367 

38.9% 

Relative  Indexed 

label [r4] 

392 

0.5% 

Relative  Deferred 

♦label 

202 

0.2%  0 

Relative  Deferred  Indexed 

♦label [r4] 

0 

0.0% 

Register 

r3 

16619 

24.5% 

Deferred 

(r3) 

1365 

2.0% 

Deferred  Indexed 

(r3) [r4] 

78 

0.1%  • 

Autoincrement 

(r3)  + 

394 

0.5% 

Autoincrement  Indexed 

(r3)  +  (r4] 

0 

0.0% 

Deferred  Autoincrement 

*(r3)  + 

0 

0.0% 

Deferred  Autoincrement  Indexed 

♦(r3)  +  Ir4] 

0 

0.0%  • 

AutoDecrement 

-(t3) 

776 

1.1% 

AutoDecrement  Indexed 

-(r3) [r4] 

0 

0.0% 

Displacement 

24  (r3) 

7894 

1 1 .6% 

Displacement  Indexed 

24 (r3) [r4] 

25 

0.0%  • 

Displacement  Deferred 

♦24 (r3) 

419 

0.6% 

Displacement  Deferred  Indexed 

*24 (r3) Ir4] 

11 

0.0% 

Total  67761  100% 


Table  6-17:  Address  Mode  Usage  -  Integer  Applications  on  VAX 

In  fact,  wtien  the  use  of  the  address  modes  is  displayed  graphically  (as  in  figure  6-6),  we  see  that 
94.6%  of  the  address  modes  used  on  the  Vax  are  filled  by  immediate  (literal  being  a  subset  of 
immediate),  relative,  register,  and  displacement  modes  -  exactly  the  modes  provided  by  the  Mips 
M/500  instruction  set.  Yet  on  the  Vax  each  instruction  must  go  through  the  effort  of  decoding  which 
addressing  mode  is  used,  even  though  (for  the  most  part)  only  5  of  the  possible  16  Vax  address 
modes  are  ever  really  used. 

Table  6-18  shows  another  interesting  artifact  of  the  Berkeley  C  compiler  (that  serves  to  show  off  the 
Mips  compiler  as  a  better  example  of  compiler  writing). 

In  tf)e  Berkeley  compiler,  local  registers  which  are  explicitly  declared  to  be  of  type  register  are 


82 


CMU/SEI-87-TR.25 


All  Others 


Immediate 


Table  6-18:  Register  Usage  -  Integer  Applications  on  Vax 

allocated  starting  at  register  rii,  working  downwards.  If  a  variable  is  not  declared  register,  it  is 
allocated  on  the  stack.  This  explains  the  decreasing  frequent/  of  register  references  from  rii  to 
r6. 
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As  an  additional  artifact,  rO  (and  rl)  are  the  function  return  registers,  while  registers  r4  and  rs  are 
rarely  allocated,  due  to  their  interaction  with  the  move  instructions.^ 

Registers  ri2  and  ri3  are  the  frame  and  argument  pointers  and  are  referenced  almost  exclusively 
as  a  pointer  to  the  stack.  The  stack  pointer  rl4  is  referenced  both  indirectly  and  directly. 

6.3.2.  Instruction  Use  by  Floating-Point  Applications 

In  this  next  test  case,  we  gave  the  compiler  a  large  floating-point  application.  We  used  the  SPICE 
program,  a  large  circuit-simulation  program  written  in  Fortran,  consisting  of  over  18,000  lines  of 
dense,  ugly  code  and  comments.  We  regret  that  only  a  single  program  was  used  in  this  test, 
however  the  instruction  count  generated  by  this  program  nearly  equaled  that  of  the  combined  integer 
applications,  so  we  feel  our  choice  was  not  a  bad  one.  We  realize  that  it  is  difficult  to  compare 
Fortran  and  c  compilers,  since  the  semantics  of  the  source  languages  differ  so  greatly  However, 
since  the  code  generator  and  optimizer  in  both  the  Vax  and  the  Mips  programming  environment  are 
common  to  both  languages,  we  feel  that  there  is  sufficient  similarity  between  the  two  compilers  to 
warrant  a  broad  comparison. 

6.3.2.1.  Analysis  of  Mips  Fortran  Compiler 

Table  6-19  shows  the  instruction  mix  generated  by  the  Mips  compiler  for  the  SPICE  program.  As  in 
section  6.3.1 .1 ,  we  have  listed  only  the  low-level  Mips  M/500  instructions  that  were  generated  by  the 
level  4  optimizer.  We  have  not  counted  the  Fortran  run-time  library  routines,  or  the  Fortran 
initialization  or  finalization  code. 


^he  DEC  compilers  do  not  suffer  from  these  aberrations  of  register  allocation  behavior. 
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cvt . d . a 

192 

cvt . d . w 

69 

cvt . s . d 

184 

cvt . w . d 

37 

m£cl 

37 

ni£M 

4 

mflo 

58 

mov.d 

634 

znov.  s 

46 

move 

2675 

2108 

Move 


Shuffle 


6044  8.0% 


41361  54.3% 


bclf 

510 

0.7% 

belt 

335 

0.4% 

beq 

1147 

1 .5% 

bgez 

32 

0.0% 

bgtz 

27 

0.0% 

blez 

45 

0.1% 

bltz 

60 

0.1% 

bne 

811 

1.1% 

CBranch 

2967 

3.9% 

b 

1555 

2.0% 

break 

40 

0.1% 

179 

0.2% 

U  Branch 

1774 

2.3% 

jal 

3120 

4.1% 

Call 

3120 

4.1% 

Control 

5053 

6.6% 

nop 

5718 

7.5% 

Total 

76024 

100% 

Table  6-1 9:  Floating-Point  Application  Instruction  Usage  -  Mips 

The  table  of  values  for  the  floating-point  performance  of  the  compiler  differs  from  the  integer  perfor¬ 
mance  (shown  in  figures  6-3  and  6-4)  in  a  number  of  ways.  First,  there  are  fewer  nop  instructions 
ar>d  a  higher  percentage  of  move  instructions.  This  is  shown  graphically  in  figure  6-7. 
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nop 

control 


compute 


□  move  -  54.3% 

B  compute  •  27.0% 
^  control  -  6.6% 

B  nop  -  7.5% 


Figure  6-7;  Instruction  Distribution  -  Floating-Point  Application 

The  decreased  number  of  nop  instructions  is  somewhat  surprising,  given  the  increased  number  of 
load  class  instructions.  The  substantially  decreased  control  operations,  however,  may  offset  this 
statistic.^'' 

The  decreased  number  of  nop  instructions  does  not  imply  that  floating-point  applications  generate 
better  code  than  integer  applications,  nor  that  Fortran  generates  better  code  than  c.  It  is  simply  the 
nature  of  this  particular  program,  which  has  a  control  structure  that  did  not  require  the  insertion  of 
many  nop  instructions.  On  the  other  hand,  though,  the  reader  should  be  aware  of  "hidden"  delays  in 
the  floating-point  computations.  While  most  Mips  M/500  instructions  are  executed  in  a  single  clock 
cycle,  the  floating-point  instructions  are  not,  and  they  require  synchronization  between  the  Mips 
M/500  and  the  floating-point  co-processor.  In  truth,  then,  the  number  of  null  operations  that  the  Mips 
M/500  is  executing  during  floating-point  operations  is  much  higher  than  these  tables  of  statistics 
would  suggest. 

Removing  the  nop  instructions  from  consideration,  we  see  the  instruction  mix  shown  in  figure  6-8. 


control 


compute 


□  move  -  61 .7% 

B  compute  -  30.7% 


control  -  7.5% 


Figure  6-8:  Instruction  Distribution  -  Floating-Point  Applications  (Minus  nops) 


*'The  load  and  jumpTbranch  instructions  have  a  delay  slot  following  them  that  must  be  filled.  If  the  assembler  reorganizer  is 
unable  to  move  instnictions  around  the  load  or  jump/branch,  it  fills  the  delay  slot  with  a  nop  instruction. 
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This  chart  shows  a  much  higher  percentage  of  move  class  instructions  than  seen  in  figure  6-4,  only  a 
small  fraction  of  which  (8.5%  of  the  total)  are  actual  register-to-register  movement.  The  dominating 
factor  is  load  instructions.  We  suspect  that  this  is  a  language  and  application  dependency  -  the 
program  makes  heavy  use  of  Fortran  common,  a  factor  which  effectively  defeats  interprocedural 
register  allocation  by  making  register  slaving  of  the  values  of  common  variables  very  difficult.  Thus, 
the  compiler  is  forced  to  load  variables  before  each  use.  This  is  not  a  fault  of  the  compiler,  or  of 
RISC  architectures,  but  is  a  result  of  the  antiquated  nature  of  the  Fortran  language.  The  heavy  use 
of  global  variables,  a  practice  highly  discouraged  by  most  modem  software  engineering  dogmas, 
extracts  its  price  in  program  performance.  This  would  also  be  the  case  if  Pascal  or  another  modular 
language  used  global  variables  with  the  frequency  of  Fortran.  The  Mips  Fortran  compiler  could 
be  strengthened  somewhat  by  placing  the  addresses  of  common  variables,  or  the  address  of  the  start 
of  common  blocks  into  globally  allocated  registers.  This  would  eliminate  some  of  the  l  ui.  and  scicixu 
instructions,  which  are  currently  used  for  accessing  common  and  passing  parameters  by  reference. 

Figure  6-8  chart  also  shows  a  much  higher  percentage  of  compute  instructions,  with  a  decreased 
percentage  of  control  operations.  We  feel  that  this  is  another  language  and  application  artifact  - 
Fortran  is  basically  a  "straight-line"  language,  with  few  deviations  from  the  top-to-bottom  execution 
model.  The  SPICE  circuit  simulator  similarly  has  few  decisions  to  make  -  most  of  the  calculations, 
though  elaborate,  are  rather  straightforward. 

The  list  of  the  frequency  of  address  mode  usage  is  shown  in  table  6-20.  The  pattern  of  usage  is  very 
similar  to  that  shown  for  integer  applications  in  table  6-1 4.  The  differences  are  that  (obviously)  a 
larger  fraction  of  floating-point  registers  are  used  in  the  SPICE  benchmark,  and  there  is  a  slight 
increase  in  the  use  of  displacement  mode.  This  latter  effect  is  probably  caused  by  two  factors  -  the 
large  number  of  variables  stored  in  common,  and  the  fact  that  Fortran  passes  parameters  to 
routines  by  reference  instead  of  by  value.  Other  than  this,  the  addressing  mode  patterns  are  fairly 
consistent. 


Address  Mode  Usage 

Immediate 

21662 

13.9% 

Absolute 

3078 

1.9% 

Register 

64787 

41.7% 

Displacement 

27601 

17.7% 

Floating-point 

39188 

25.2% 

Total 

155316 

100% 

Table  6-20:  Address  Mode  Usage  -  Floating-Point  Application  on  Mips 

The  register  usage  patterns  are  shown  in  table  6-21.  The  frequency  of  use  of  many  of  the  registers 
differs  greatly  from  that  of  for  the  integer  applications  shown  in  table  6-15.  This  is  caused  by  a 
number  of  factors: 

•  The  assembler/reorganizer  temporary  register  at  is  used  more  frequently  in  offset 
mode.  This  is  due  largely  to  the  fact  that  cc^on  variables  are  addressed  relative  to 
the  base  of  their  respective  common  regions,  and  global  address  references  translate  to 
an  offset  from  at  by  the  assembler  reorganizer. 
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Register 

Integer 

Value 

Offset 

Floating  Point 
Register 

Total 

zero 

3306 

0 

fO 

2860 

at 

9825 

5655 

f1 

636 

vO 

3179 

1093 

f2 

1772 

vl 

1549 

1077 

f3 

411 

aO 

3607 

382 

f4 

4285 

a1 

2431 

339 

f5 

1495 

a2 

1885 

201 

f6 

4175 

a3 

1010 

152 

f7 

1473 

to 

1147 

119 

f8 

4042 

t1 

1426 

151 

f9 

1373 

t2 

1410 

183 

flO 

4293 

t3 

1409 

174 

f11 

1493 

t4 

1667 

144 

f12 

1405 

t5 

1565 

152 

f13 

260 

16 

3334 

422 

f14 

961 

t7 

3254 

440 

f15 

171 

SO 

2671 

940 

f16 

882 

s1 

2438 

820 

f17 

215 

S2 

1709 

455 

f18 

1059 

S3 

1439 

287 

f19 

304 

S4 

1093 

207 

f20 

1140 

s5 

905 

433 

f21 

383 

S6 

930 

114 

122 

863 

S7 

978 

21 

123 

256 

18 

3251 

387 

124 

751 

t9 

3126 

436 

125 

233 

kO 

0 

0 

126 

556 

k1 

0 

0 

127 

156 

gp 

1026 

708 

128 

494 

sp 

2018 

12005 

129 

140 

701 

102 

130 

406 

ra 

494 

2 

131 

245 

hi 

4 

0 

lo 

0 

0 

Total 

64787 

27601 

Total 

39188 

Table  6-21 ;  Register  Usage  -  Floating-Point  Application  on  Mips 
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•  Temporary  registers  are  allocated  on  a  round-robin  basis,  and  so  show  a  fairly  uniform 
distribution  of  use.  Registers  t6,  t7,  and  t.8  are  usually  allocated  first,  and  thus  their 
reference  count  is  a  fraction  higher  than  the  other  temporaries. 

•  Floating-point  registers  are  used  with  much  greater  frequency  (the  SPICE  circuit  simu¬ 
lator  is  a  floating-point  program.  The  registers  show  an  interesting  pattern  of  use, 
though: 

•  The  odd  numbered  registers  are  used  much  less  frequently  than  are  the  even 
numbered  ones.  This  is  because  double  precision  floating-point  numbers  are 
stored  in  two  registers  (and  referenced  by  the  low  order  register  of  the  pair).  The 
vast  majority  of  floating-point  variables  in  SPICE  are  double  precision  variables. 

•  The  register  allocation  algorithm  for  floating-point  variables  does  not  appear  to  be 
the  same  round-robin  scheme  that  is  used  for  temporary  registers.  Instead, 
floating-point  registers  show  a  roughly  exponentially  decreasing  frequency  of  use 
from  register  £4  to  £30. 

•  Subroutine  parameters  are  passed  in  registers  aO  through  a4,  but  the  pattern  seen  in 
table  6-15  does  not  show  up  here.  This  is  because  double-precision  floating-point  vari¬ 
ables  are  passed  in  two  argument  registers  (instead  of  one  for  integer  variables),  and  so 
the  usage  curve  decays  more  slowly. 

•  The  kernel  registers  kO  and  kl  are  never  referenced  (not  surprisingly).  They  are  used 
exclusively  by  the  Mips  Unix  kernel. 

•  The  saved  registers  sO  through  s9  are  used  for  local  variables  and  must  be  saved 
across  procedure  calls.  They  are  allocated  in  order,  and  show  a  steadily  decreasing 
frequency  of  reference  from  sO  through  s9,  indicating  that  there  are  more  procedures 
with  a  small  number  of  local  variables  than  there  are  with  large  numbers  of  them. 

6.3.3.  Comparison  with  VAX  Unix  Fortran  Compiler 

The  SPICE  benchmark  was  also  given  to  the  Vax  Fortran  compiler  for  comparison  purposes.  The 
data  on  instruction  usage  is  shown  in  table  6-22. 


CMU/SEI-87-TR-29 


89 


addd2 

adddS 

add£3 

addl2 

addl3 

divd2 

divd3 

di.vl2 

divl3 


tnuld2 

muld3 

mul£3 

mull2 

mull3 

subd2 

aubd3 

sub£3 

stibl2 

stibl3 

Arithmetic 


Logical 

Compute 


Cbranch 


Ubranch 

calls 

Control 


23.4% 


24.2% 


9579  25.1% 

38095  100% 


Move  19278  50.6% 

Table  6*22:  Floating-Point  Application  Instruction  Usage  -  Vax 

As  with  the  Mips  instruction  mix  in  table  6-19,  we  see  a  decrease  in  the  number  of  control  type 
instructions,  and  an  increase  in  the  number  of  arithmetic  instructions.  Again,  we  see  the  unusually 
high  number  of  move  instructions,  even  though  the  Vax  is  not  a  load/store  architecture. 

What  is  most  interesting  is  the  number  of  compare  instructions  in  the  Vax  instruction  mix.  There  are 
no  compare  instructions  on  the  Mips;  instead,  the  instructions  used  to  perform  conditional  branches 
contain  the  operands  to  be  compared.  On  the  Vax,  two  instructions  need  to  be  executed  to  perform 
most  conditional  branches:  a  compare  and  a  branch.  Rarely,  if  ever,  are  the  condition  codes  used. 
Thus,  even  though  the  Vax  has  a  more  complex  instruction  set,  the  Mips  M/500  has  the  mechanism 
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for  performing  a  conditional  branch  in  a  single  instruction.®^ 

Also  as  before,  many  instructions  are  underused.  Instructions  such  as  mnegl  which  moves  the 
negative  of  a  number  into  a  register  (saving  3  bytes  of  instruction),  are  used  less  than  one-tenth  of 
one  percent  of  the  time.  It  would  be  better  to  load  a  negative  number  directly,  or  to  load  a  positive 
one  and  then  negate  it,  than  to  waste  the  processor  floorspace  to  implement  the  function  in  a  single 
instruction  that  is  rarely  used. 

The  address  mode  usage  on  the  Vax  by  the  SPICE  simulator  is  shown  in  table  6-23.  With  the 
exception  of  the  10%  use  of  the  Relative  indexed  mode,  the  distribution  of  address  modes  is  similar 
to  that  shown  in  table  6-17. 


Address  Mode 

Example 

Count 

Percentage 

Immediate 

$270 

870 

1.1% 

Literal 

$24  (n  <  64) 

5873 

8.0% 

Absolute 

$* label 

0 

0.0% 

Absolute  Indexed 

$*label[r4] 

0 

0.0% 

Relative 

label 

16323 

22.3% 

Relative  Indexed 

label [r4] 

7339 

10.0% 

Relative  Deferred 

♦label 

0 

0.0% 

Relative  Deferred  Indexed 

♦label [r4] 

0 

0.0% 

Register 

r3 

18394 

25.2% 

Deferred 

(r3) 

233 

0.3% 

Deferred  Indexed 

(r3)  [r4] 

14 

0.0% 

Autoincrement 

(r3)  + 

0 

0.0% 

Autoincrement  Indexed 

(r3)+[r4] 

0 

0.0% 

Deferred  Autoincrement 

♦(r3)  + 

0 

0.0% 

Deferred  Autoincrement  Indexed 

*  (r3)  +  [r4] 

0 

0.0% 

AutoDecrement 

-(r3) 

431 

0.5% 

AutoDecrement  Indexed 

-(r3) [r4] 

0 

0.0% 

Displacement 

24 (r3) 

22217 

30.4% 

Displacement  Indexed 

24 (r3) [r4] 

284 

0.3% 

Displacement  Deferred 

♦24 (r3) 

944 

1.2% 

Displacement  Deferred  Indexed 

♦24 (r3) tr4] 

18 

0.0% 

Total 

72930 

100% 

Table  6-23:  Address  Mode  Usage  -  Floating-Point  Application  on  Vax 


*^On  those  occasions  when  the  Mips  assembler  reorganizer  must  expand  a  conditional  to  two  or  three  instructions,  the 
altx  arxt  xor  instructions  are  used.  The  total  of  these  instructions  does  not  come  dose  to  the  amount  ot  compare 
instructions  used  on  the  Vax.  Apparently,  then,  the  Mips  M/500  does  conditional  branches  more  efficiently  than  the  Vax. 
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The  extra  high  use  of  the  Relative  Indexed  mode  is  either  because  of  Fortran’s  parameter  passing 
mechanism  or  its  access  to  common  arrays.  Other  than  this,  we  make  the  same  observation  that  we 
made  for  the  integer  applications;  of  the  16  addressing  modes  available  on  the  Vax,  only  5  are  ever 
really  used  (basically  the  same  addressing  modes  that  are  available  on  the  Mips  M/500).  The  CPU 
could  thus  be  substantially  simplified  without  any  major  loss  in  efficiency  of  compiled  code. 

Looking  at  table  6-24,  we  see  the  same  symptoms  as  we  found  in  table  6-18,  except  that  in  this 
case,  register  allocation  is  even  worse.  Of  the  registers  r6  to  rli,  only  rii  is  ever  really  used. 


Register  Usage  by  Class 

Value  Pointer 

Index 

Total 

rO 

11351 

127 

4383 

15861 

r1 

1228 

91 

555 

1874 

r2 

2087 

2 

487 

2576 

r3 

1 

0 

1 

2 

r4 

87 

0 

0 

87 

r5 

0 

0 

0 

0 

r6 

227 

4 

16 

247 

r7 

280 

8 

18 

306 

r8 

391 

11 

45 

447 

r9 

839 

14 

65 

918 

no 

1580 

6 

261 

1847 

r11 

193 

17150 

1824 

19167 

r12 

0 

1277 

0 

1277 

r13 

0 

5020 

0 

5020 

r14 

130 

431 

0 

561 

r15 

0 

0 

0 

0 

Total 

18394 

24143 

7655 

50192 

Table  6-24:  Register  Usage  -  Floating-Point  Application  on  Vax 


This  underuse  is  predominantly  a  failing  in  the  Berkeley  Fortran  compiler,  and  not  inherent  to  the 
VAX.  If  the  Berkeley  compiler  had  an  adequate  register  allocation  algorithm,  we  would  see  a  much 
better  pattern  of  register  use.  As  it  is,  however,  some  registers  are  over  used,  and  some  are  badly 
underused. 
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6.3.4.  Local  Conclusions 

It  is  interesting  to  note  that  the  Mips  compilers  generated  73  out  o1  the  85  possible  instructions®^  for 
the  Mips  M/500  on  these  four  programs.  The  use  of  85%  of  the  possible  instructions  attests  to  the 
validity  of  this  test  as  a  fair  coverage  of  the  instruction  spectrum  of  a  machine.  When  we  look  at  the 
Berkeley  VAX  compilers,  we  find  that  of  the  146  possible  instructions,  1 1 1  (75%)  were  generated  for 
our  test  programs. 

If  we  examine  instead  the  use  of  the  entire  instruction  set  by  the  compilers,  we  find  that  the  Mips 
compilers  use  54%  of  the  total  Mips  M/500  instruction  repertoire  (73  out  of  135  instructions),  while 
the  Berkeley  compilers  could  only  use  34%  of  the  Vax  instruction  set  (1 1 1  out  of  323  instructions). 
The  fact  that,  in  both  comparisons,  a  lower  percentage  of  instructions  was  used  by  the  Vax  attests  to 
the  overcomplicated  nature  of  the  Vax  CISC  architecture. 

If  we  compare  the  number  of  instructions  that  were  generated,  we  find  that  the  Mips  program  (with 
152786  instructions)  used  only  1.82  times  more  instructions  than  the  Vax  (with  83644  instructions). 
When  the  byte  count  is  compared  (a  much  more  valid  measure),  the  Mips  uses  611144  bytes  versus 
the  VAX’s  use  of  474224  bytes,  the  code  size  increase  is  actually  only  1 .29  : 1 .  This  is  because  Mips 
instructions  are  always  4  bytes  long,  while  Vax  instructions  vary  in  length  depending  on  the  address¬ 
ing  modes  used.  Since  the  Mips  M/500  is  far  more  than  1.29  times  faster  than  the  Vax,  we  may 
assume  that  the  penalty  of  more  instructions  being  required  to  perform  a  task  which  is  incurred  by 
moving  to  a  RISC  architecture,  is  more  than  offset  by  the  increased  performance  a  RISC  architecture 
provides. 

From  the  three  analyses  that  we  have  performed  (static  compiler  analysis,  an  instrumented  example, 
and  dynamic  compiler  performance),  the  choice  of  a  RISC  architecture  has  won  out  over  a  CISC 
architecture.  Each  of  the  analyses,  considered  independently  or  collectively,  shows  that  it  is  easier 
for  a  compiler  to  generate  code  for  a  RISC  architecture,  and  that  that  code  executes  more  efficiently. 
One  might  be  tempted  to  look  at  the  results  from  the  Vax  and  conclude  that  the  Vax  compilers  need 
to  be  made  more  robust.  A  better  conclusion,  however,  is  that  the  instructions  and  addressing 
modes  that  are  not  used  by  the  Vax  compilers  are  simply  not  needed. 


possfcia  instructions  ara  thosa  that  tha  compiler  can  generate,  not  those  that  the  Mips  M/500  can  execute  (see 
section  6.1.1 .2). 
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7.  General  Drawbacks  of  Assembler-only  Code 
Reorganization 

The  Mips  compiler  suite  uses  an  assembler  reorganizer  (described  in  chapter  3)  to  translate  from  a 
high-level  assembly  language  to  the  Mips  M/500  native  machine  code.  The  assembler  reorganizer 
also  serves  the  function  of  making  sure  that  the  restrictions  of  the  instruction  pipeline  are  observed. 
These  restrictions  include  a  one-cycle  delay  following: 

•  a  branch  or  jump  instruction 

•  a  load  from  memory  before  the  value  is  available 

•  a  double  precision  move  operation 

•  a  co-processor  control  operation 
and  a  two-cycle  delay  following; 

•  a  move  from  the  lo  or  hi  register. 

It  is  possible,  without  knowing  the  semantics  of  a  program,  to  use  value  tracking  to  determine  when 
an  instruction  will  modify  the  source  of  a  subsequent  instruction.  The  Mips  assembler  reorganizer 
uses  this  information  to  move  instructions  forward  in  the  execution  order  to  fill  in  the  delay  slots 
required  by  the  pipeline  (see  section  3.1  for  details).  This  eliminates  a  large  number  of  delay  slots 
that  would  otherwise  have  to  be  filled  with  nop  instructions.  However,  it  is  our  contention  that  a 
reorganizer  belongs  in  the  compiler,  not  in  the  assembler. 

Clearly,  the  assembler  must  verify  that  the  pipeline  constraints  are  satisfied.  However,  the  Mips 
assembler  also  translates  the  high-level  instructions  into  the  Mips  M/500  native  machine-level  in¬ 
structions,  sometimes  expanding  simple  instructions  into  a  sequence  of  instructions.  While  this 
makes  it  easie:  tu  write  code  in  assembly  language  by  hand,  it  has  deleterious  effects  on  compilers. 
We  therefore  assert  that  the  proper  place  for  a  reorganizer  is  in  the  compiler,  and  not  in  a  post¬ 
processing  assembier. 

To  support  our  claim,  we  dte  the  following  seven  issues  (which  will  be  explained  in  greater  detail  in 
later  sections): 

1 .  The  code  generator  knows  a  lot  more  about  aliasing^  than  the  assembler.  Although  it 
is  difficult  to  detect  aliasing  in  a  compiler,  it  is  even  more  difficult  to  detect  it  in  the 
language-context  free  environment  that  is  presented  to  the  assembler.  Since  a  reor¬ 
ganizer  must  consider  aliasing  effects,  it  is  better  to  put  a  reorganizer  in  the  compiler. 

2.  The  compiler  understands  about  the  alignment  of  variables.  It  can  know  when  it  is  not 
necessary  to  reload  the  top  16  bits  of  an  address^^  by  ensuring  that  the  top  16  bits  are 
the  same  for  two  variable  components  (i.e.,  a  Fortran  complex  type).  The  as¬ 
sembler  could  make  similar  deductions  from  carefully  placed  .  align  directives,  but 


** Aliasing  is  th«  condition  under  which  two  address  expressions  rsferertce  the  same  memory  location. 

’’The  Mips  M/SOO  can  only  store  the  low  16  bits  at  an  address  in  an  instruction.  If  a  32  bit  address  must  be  generated,  it 
must  be  dorte  in  two  instructions. 
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seems  not  to  do  so  -  and  anyway,  the  compilers  do  not  generate  .align  directives 
other  than  tor  word  alignment. 

3.  Since  the  assembler  reorganizer  may  need  to  perform  some  intermediate  calculations 
in  the  Mips  M/500  native  instruction  set  to  implement  the  high-level  instructions  that 
are  given  to  it,  the  assembler  must  reserve  a  temporary  register  for  this  purpose  (i.e., 
at).  This  leaves  one  less  register  for  the  compiler  to  use,  and  often  results  in  the 
needless  recalculation  or  reloading  of  temporary  values  that  a  compiler  could  store  in 
one  of  its  registers. 

4.  The  assembler  assumes  only  a  single  base  register  (i.e  ,  gp).  However,  it  is  often 
much  more  efficient  to  allow  the  compiler  to  allocate  multiple  base  registers  -  for  ex¬ 
ample,  one  for  a  given  routine,  or  one  for  read-only  data,  etc. 

5.  With  a  knowledge  of  the  reorganization  requirements  of  the  hardware,  a  compiler  can 
make  intelligent  decisions  about  delaying  arithmetic  calculations.  The  assembler  reor¬ 
ganizer  must  be  very  pessimistic  about  moving  arithmetic  instructions  forward  or  back¬ 
ward,  for  fear  of  affecting  numeric  results.  With  the  expression  semantics  available  to 
it,  the  compiler  is  much  more  able  to  move  instructions  to  avoid  nop  delays. 

6.  The  assembler  cannot  easily  reverse  the  effects  of  code  hoi?*ing  (either  a-motion  or 
omotion).  In  this  case,  compiler  optimization  effectively  reduces  the  strength  of  the 
final  assembly  code. 

7.  Since  the  Mips  assembler  is,  in  effect,  a  macro-assembler,  the  final  peephole  optimiza¬ 
tion  performed  by  the  compilers  is  defeated  by  the  macro  expansion  performed  by  the 
assembler.  The  assembler  reorganizer  must  then  supplement  the  compilers’  optimiza¬ 
tion  with  a  peephole  optimization  of  its  own,  but  this  is  less  efficient  than  doing  all 
optimization  in  the  compiler. 


We  will  now  present  a  number  of  simple  examples  that  demonstrate  the  problems  dted  above. 
These  examples  are  all  somewhat  contrived,  and  are  designed  to  illustrate  the  problem  in  as  small  a 
space  as  possible.  Thus,  the  code  fragments  themselves  may  look  somewhat  unreasonable.  The 
reader  is  assured,  however,  that  real-life  examples  that  trigger  these  same  symptoms  exist  in  profu¬ 
sion. 


7.1.  Alignment  Problems  in  the  Reorganizer 

On  the  Mips  M/500,  all  addresses  are  stored  as  32-bit  quantities.  However,  an  instruction  that 
references  a  global  variable  must  first  load  the  upper  16  bits  of  the  address  of  the  variable  with  an 
lui  instruction,  followed  by  an  instruction  that  references  the  low  16-bits  of  the  address.  Very  often, 
the  assembler  cannot  know  the  alignment  of  two  variables  relative  to  each  other.  Consequently,  it 
must  load  the  upper  16-bits  of  the  address  of  each  global  variable  each  time  it  references  one  of 
them  (since  it  is  unable  to  determine  whether  the  variables  are  in  the  same  16-bit  address  space  -  a 
fact  which  may  change  between  assembly  and  link  time,  especially  if  the  variables  are  declared  in 
different  modules). 

When  the  components  of  a  variable  that  is  larger  than  a  single  word  (i.e.,  a  Fortran  coo^iex 
variable,  or  a  C  structure),  the  compiler  can  align  the  variable  on  a  known  boundary  in  such  a  way 
that  it  is  guaranteed  that  the  upper  16  bits  of  the  address  of  all  of  the  components  of  the  variable  are 
the  sa:  .a.  Then  the  upper  16  bits  need  only  be  loaded  once  for  a  sequence  of  accesses. 
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#  align  on  2**3  byte  bovindary 


.  data 
.align  3 

gmplx :  . word  0 , 0 

.  text 

align: 

lb  $2, cmplx 
lb  $3 ,  anplxtl 
Ih  $4 .  ciiiplx-f2 
Iw  $5 ,  cn^lxt 4 

Figure  7-1 :  Alignment  Problem  -  Assembler  Source  Code 

Examine  figure  7-1.  Note  that  the  variable  cmpi^:  is  a  double-word  quantity  aligned  on  a  double- 
word  boundary.  The  top  16  bits  should  be  the  same  for  a// of  the  above  operand  addresses  (cmplx, 
cmpix-n,  etc.).  Even  if  the  assembler/reorganizer  cannot  recognize  the  alignment  of  the  two  words 
comprising  the  double  word,  at  least  the  first  three  instructions  can  share  one  load  of  $at. 


0x0: 

align : 
3c010000 

lui 

at, 0x0 

0x4  : 

80220028 

lb 

vO, 40 (at) 

0x8; 

3c010000 

lui 

at, 0x0 

Oxc : 

80230029 

lb 

vl,  41 (at) 

0x10 

3c010000 

lui 

at , 0x0 

0x14 

8424002a 

Ih 

aO,  42 (at) 

0x18 

ScOlOOOO 

lui 

at ,  0x0 

Oxlc 

8c25002c 

Iw 

al, 44 (at) 

0x20 

00000000 

nop 

Figure  7-2:  Alignment  Problem  -  Mips  M/500  Code 

As  shown  in  figure  7-2,  the  register  $at  is  loaded  afresh  for  every  operand,  quite  needlessly.®®  The 
excuse  that  the  individual  loads  might  reference  words  that  are  in  different  64Kb  segments  is  fal¬ 
lacious,  since  the  .  align  directive  ensures  that  this  is  not  the  case.  A  compiler  would  be  aware  of 
the  alignment  of  every  object,  while  the  assembler  reorganizer  is  not.  For  unaligned  objects,  a 
compiler  could  load  the  true  start  address  into  a  base  register.  For  aligned  objects  (such  as  that 
shown  in  figure  7-1),  it  could  load  the  top  16  bits  into  a  base  register  once,  and  not  reload  it  each 
time.  In  this  example,  the  code  could  be  reduced  to  a  little  more  than  half  of  its  original  size. 


7.2.  Problems  with  Aliasing 

The  Mips  M/500  assembler  reorganizer  knows  nothing  about  the  sources  or  targets  of  load  and  store 
operations.  Thus,  when  it  is  dealing  with  registers  that  are  pointing  to  data  (i.e.,  based  address 
mode),  it  must  assume  that  the  registers  are  aliased  -  that  is,  it  must  assume  that  since  two  registers 
may  contain  the  same  value,  they  may  point  at  the  same  data  item.  Therefore,  the  assembler 


**Th«  lul  wM  not  n«cMMrily  load  lha  vaKia  0.  Inataad,  tha  tinkar  wiR  fiR  in  iNa  value  at  link  time  with  the  correct  base 
addraas. 
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reorganizer  must  avoid  reorganizing  around  load/stores  that  involve  based  address  mode.®^ 


« 

*ptrl  =  *ptr2; 

Iw 

$12,  0($8) 

sw 

$12,  0($9) 

# 

*ptr3  =  *ptr4; 

Iw 

$13,  0($10) 

sw 

$13,  0($11) 

Figure  7*3:  Aliasing  Problem  -  Assembly  Source 

When  faced  with  the  problem  of  generating  code  for  a  copy  from  one  set  of  pointers  to  another,  a 
compiler  might  generate  the  code  shown  in  figure  7-3.  The  code  is  straightf onward  and  concise  -  the 
variables  are  loaded  using  based  address  mode  and  stored  the  same  way.  However,  consider  the 
Mips  M/500  code  that  the  assembler  reorganizer  generates. 


0x0: 

SdcfOOOO 

Iw 

t4,0(t0) 

0x4  ; 

00000000 

nop 

0x8: 

afOfOOOO 

sw 

t4,0(tl) 

Oxc : 

8f280000 

Iw 

t5,0(t2) 

0x10  : 

00000000 

nop 

0x14  : 

ad280000 

sw 

t5, 0  (t3) 

Figure  7-4:  Aliasing  Problem  -  Mips  M/500  Code 

As  shown  in  figure  7-4,  the  Mips  M/500  code  that  is  generated  contains  a  nop  instruction  between 
each  load  and  store.  This  satisfies  the  pipeline  delay  that  is  required  before  the  values  become  valid 
in  the  registers.  If  the  compiler  is  given  the  true  instruction  set  of  the  machine  to  operate  with, 
instead  of  a  high-level  assembly  language,  these  nop  instructions  can  be  avoided. 

The  assembler  reorganizer  is  not  filling  in  these  nop  slots  because  it  cannot  tell  whether  the  target  of 
the  store  operation  is  the  same  as  (i.e.,  if  it  is  aliased  to)  the  source  of  the  second  load.  A  compiler 
could  determine  whether  aliasing  was  a  concern,  and  if  it  determined  that  it  was  not,  it  could  rewrite 
the  code  as  in  figure  7-5. 


0x0: 

SdcfOOOO 

Iw 

t4,0(t0) 

0x4: 

8f280000 

Iw 

t5,0(t2) 

0x8: 

afOfOOOO 

8W 

t4,0(tl) 

Oxc : 

ad280000 

SW 

t5,0(t3) 

Figure  7-5:  Aliasing  Problem  Corrected 

Notice  that  the  delay  slots  have  been  filled  by  reorganizing  the  code.  Since  the  first  store  does  not 
affect  the  second  load,  that  load  may  be  moved  in  front  of  the  second  store.  Since  the 
assembler/reorganizer  is  unaware  of  the  presence  or  absence  of  any  aliasing,  it  is  unable  to  perform 
this  function  -  a  strong  argument  in  favor  of  putting  the  reorganizer  function  in  the  compiler,  which 
can  do  much  stronger  analysis  of  aliasing.  For  example,  in  a  strongly  typed  language,  two  pointers 
with  different  base  types  cannot  be  aliases.  A  compiler  would  know  this,  since  it  knows  the  types; 
the  assembler  cannot  know  this,  since  it  has  no  type  information. 


fact,  tor  addresses  that  are  declared  external,  the  assembler/reorganizer  must  rwt  reorganize  around  load/stores  that 
involve  relocatable  or  absolute  addresses.  This  is  because  it  has  r>o  guarantee  that  the  two  addresses  will  not  be  the  same  at 
link  time  (i.e.,  two  labels  referring  to  the  same  data  location). 


# 
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We  would  like  to  note  that  the  reorganizer  is  pretty  good  about  moving  code  that  is  unaffected  by 
aliasing.  For  example,  had  a  set  or  arithmetic  operations  followed  the  load/stores,  using  different 
registers  as  sources  and  destinations,  the  assembler  reorganizer  would  have  moved  them  upward  to 
fill  in  the  delay  slots.  There  are  numerous  cases,  however,  where  this  sort  of  action  will  be 
precluded. 


7.3.  Delaying  Calculations  to  Avoid  No-Ops 

Because  the  Mips  M/500  assembler  reorganizer  cannot  know  the  code  generators  intent  when  scan¬ 
ning  a  piece  of  assembly  code,  it  be  must  very  pessimistic  about  reorganizing  code.  Even  when  it 
knows  all  of  the  "come-from"  locations,  if  will  not  reorganize  around  a  label.  We  assert  that  a 
compiler,  armed  with  the  semantics  of  the  source  language  (and  thus  mindful  of  the  programmer's 
intentions)  can,  with  much  greater  confidence,  rearrange  the  assembly  language  that  it  produces. 

int  gl , g2 , g3 ; 

int  hi , h2 , h3 ; 

delay  () 

{ 


gl  =  g2  +  g3; 
hi  =  h2  +  h3; 

} 

Figure  7-6;  Example  of  Assembly  Rearrangement  -  C  Source 

Figure  7-6  shows  a  simple  example  of  a  routine  that  adds  two  pairs  of  global  variables  and  places 
the  results  in  a  third  pair.  The  assembly  language  that  is  generated  (figure  7-7)  is  perfectly  reason¬ 
able  -  the  values  g2  and  g3  are  loaded  into  memory,  added  together,  and  stored  in  gi.  Then  the 
values  h2  and  h3  are  loaded  into  memory,  added  together,  and  stored  in  hi. 

delay : 


gl 

=  g2  +  g3; 

Iw 

$14,  g2 

Iw 

$15,  g3 

addu 

$24,  $14,  $15 

sw 

$24,  gl 

hi 

=  h2  s£+  h3; 

Iw 

$25,  h2 

Iw 

$8,  h3 

adduu 

$9,  $25,  $8 

8W 

$9,  hi 

Figure  7-7:  Example  of  Assembly  Rearrangement  -  Assembly  Output 

For  a  machine  that  is  not  pipelined,  this  is  perfectly  reasonable  behavior  on  behalf  of  the  compiler. 
However,  recalling  the  pipeline  restrictions,  the  scsembler  must  provide  a  one-cyde  delay  following 
each  load  (the  iw  instructions)  before  the  value  in  the  register  becomes  valid.  Thus,  the  delay  slot  of 
the  t .  It  load  is  filled  with  the  second  load,  but  the  delay  slot  for  the  second  load  must  be  observed 
before  the.  addu  instructions  can  be  executed. 
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0x0: 

delay : 
8f8e0000 

Iw 

t6, 0 (gp) 

0x4: 

8f8f0000 

Iw 

t7, 0 (gp) 

0x8: 

00000000 

nop 

Oxc : 

01cfc021 

addu 

t8,t6,t7 

0x10 

a£980000 

sw 

t8, 0 (gp) 

0x14 

8£990000 

Iw 

t9, 0 (gp) 

0x18 

8£880000 

Iw 

to,  0 (gp) 

Oxlc 

00000000 

nop 

0x20 

03284821 

addu 

tl,t9,t0 

0x24 

a£890000 

sw 

tl,  0 (gp) 

Figure  7-8:  Example  of  Assembly  Rearrangement  -  Mips  M/500  Code 

As  shown  in  figure  7-8,  the  assembler  reorganizer  is  unable  to  move  any  instructions  downward  to  fill 
either  of  these  delay  slots.  Examining  either  the  assembly  code  or  the  Mips  M/500  code,  however, 
shows  that  the  code  can  be  reorganized  in  a  better  way.  Instructions  can  be  moved  forward  and 
backward  Xo  fill  in  the  delay  slots.  Figure  7-9  shows  this  hand-optimized  reorganization. 


0x0; 

delay: 

8£8e0000 

Iw 

t6, 0 (gp) 

0x4: 

8£8£0000 

Iw 

t7, 0 (gp) 

0x8: 

8£990000 

Iw 

t9, 0 (gp) 

Oxc : 

01c£c021 

addu 

t8,t6,t7 

0x10: 

8£880000 

Iw 

to, 0 (gp) 

0x14; 

a£980000 

sw 

t8, 0 (gp) 

0x18: 

03284821 

addu 

tl,t9,t0 

Oxlc: 

a£890000 

sw 

tl, 0 (gp) 

Figure  7-9;  Example  of  Assembly  Rearrangement  -  Optimized  Mips  M/500  Code 

Notice  that  the  load  of  h2  has  been  moved  backward  to  fill  the  delay  slot  required  after  the  load  of 
g3.  The  store  to  gi  has  been  moved  forward  to  fill  in  the  delay  slot  after  the  load  of  h3.  The  net 
result  is  that,  while  the  code  in  figure  7-9  performs  exactly  the  same  function  as  the  code  in  figure 
7-8,  it  is  20%  smaller.  We  do  not  claim  that  a  20%  increase  in  speed  can  be  obtained  with  this 
optimization  technique.  However,  as  was  explained  in  section  6. 3. 1.1,  over  14%  of  the  code  gener¬ 
ated  by  the  Mips  C  compiler  were  nop  instructions  -  a  figure  which  could  be  substantially  reduced 
with  this  and  other  optimizations.®® 


7.4.  Macro  Expansion  Defeating  Peephole  Optimization 

It  was  often  observed  in  the  Berkeley  compilers  that  the  so-called  optimization  phase  was  not  a  true 
optimizer,  but  rather  a  neatener.  This  is  basically  all  tfiat  a  peephole  optimizer  is  able  to  do  -  neaten 
the  generated  code  somewhat.  The  Mips  M/500  assembler  reorganizer  suffers  from  this  same 


random  aampling  of  nop  instructions  (shown  in  soction  6.2.2,  page  60)  found  that  over  70%  of  the  nops  in  a  given 
system  application  could  be  eliminated.  There  is  probably  room,  therefore,  for  Lf  sroximately  another  10%  increase  in  speed 
in  program  execution  by  performing  better  nop  elimination. 
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problem.  After  the  compiler  has  done  a  good  job  of  optimizing  for  the  Mips  virtual  machine, the 
assembler  reorganizer  expands  each  of  these  instructions  into  the  corresponding  Mips  M/500  in¬ 
structions,  effectively  messing  up  the  optimization.  The  peephole  optimizer  in  the  assembler  reor¬ 
ganizer  can  then  only  "neaten  up"  after  it  has  rumpled  the  previously  elegant  code. 

Consider  that  a  compiler  has  the  conditional  expression 

( (a  <=  b)  and  (c  >  5) )  or  ( (a  >  b)  and  (c  =  0) ) 
for  which  to  generate  code.  It  would  certainly  be  reasonable  for  the  compiler  to  calculate  a  <=  r 
and  negate  the  result  (and  thus  have  the  result  of  both  a  <=  b  and  a  >  b).  One  reasonable  way  of 
doing  this  on  the  Mips  would  be  as  shown  in  figure  7-10. 

sle  $8, $4, $5 

not  $9, $8 

Figure  7-10:  Assembler  Reorganizer  Defeating  Optimization  -  Assembler  Source 

When  this  code  is  presented  to  the  assembler  reorganizer,  the  code  that  it  generates  is  changed 
somewhat,  as  in  figure  7-11.  Instead  of  the  sle  that  the  compiler  requested,  the 
assembler/reorganizer  has  changed  it  into  an  sit  (with  reversed  operands),  followed  by  an  xori. 
This  is  then  followed  by  the  compiler-requested  not. 

0x0:  00a4402a  sit  t0,al,a0 

0x4:  39080001  xori  t0,t0,0xl 

0x8:  01004827  nor  tl,t0,zero 

Figure  7-11 :  Assembler  Reorganizer  Defeating  Optimization  -  Mips  M/500  Code 

Since  the  sit  instruction  just  sets  the  lower  bit  of  tO.  the  exclusive  xor  with  the  constant  1  is  a 
complementation  (i.e.,  a  not),  which  is  immediately  complemented  again  by  the  nor  instruction. 
What  results  is  that  the  compiler  has  generated  what  it  believes  to  be  good  code,  but  the  final  effect 
of  the  assembler  reorganizer  is  to  generate  poor  code,  because  the  macro  expansion  follows  the 
low-level  optimization. 


7.5.  Drawbacks  of  Reserving  a  Temporary  Register  for  the 
Assembler 

Because  the  assembler  reorganizer  must  rewrite  the  code  that  is  given  to  it  by  the  compiler,  it  often 
must  add  instructions  into  the  assembly  stream  to  overcome  the  shortcomings  of  the  Mips  M/500 
native  instruction  set.  Very  often  it  needs  to  use  a  temporary  register  to  hold  some  intermediate 
values.  This  register  is  at,  and  is  reserved  by  the  assembler  reorganizer  for  its  own  use. 

We  assert  that  this  use  of  a  register  unavailable  to  the  compiler  is  a  mistake  for  at  least  two  reasons: 
1 .  The  compiler  is  denied  the  use  of  this  register,  and  so  has  fewer  registers  to  allocate. 


^he  assembly  language  that  is  available  to  the  user  and  to  the  compilers  is  not  the  actual  machine  language  used  by  the 
Mips  M/SOO.  The  assembler  reorganizer  translates  high-level  instructions  into  low-level  Mips  M/500  machine  instructions. 
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Although  this  is  a  minor  point  with  25  other  registers  to  use,®®  it  does  reduce  the 
efficiency  of  the  machine  somewhat. 

2.  There  are  times  when  it  is  more  efficient  to  store  two  temporary  values,  but  the  as¬ 
sembler  is  constrained  to  building  work-arounds. 

We  feel  the  latter  reason  is  the  more  important,  and  we  demonstrate  our  reasons  in  the  following 
example.  Consider  the  source  code  shown  in  figure  7-12.  All  that  the  code  is  doing  is  incrementing 
two  (global)  variables  by  1 . 

X  :  =  X  1  ; 

y  :=  y  -f-  1; 

Figure  7"12;  Temporary  Register  Problem  -  High  Level  Source 

A  compiler  that  is  somewhat  aware  of  the  reorganization  requirements  of  the  target  machine  might 
generate  code  of  the  form  shown  in  figure  7-13.  The  code  is  interleaving  the  loads  and  adds  to  avoid 
the  delay  following  a  load  from  memory  required  by  the  pipeline. 


.  data 
. align 

2 

x: 

•word  0 

y: 

.word  0 

.text 

tempreg; 

Iw 

$2,x 

Iw 

$3,y 

add 

$2, $2,1 

add 

$3, $3,1 

sw 

$2,x 

sw 

$3,y 

Figure  7-13:  Temporary  Register  Problem  -  Assembly  Code 

As  shown  in  figure  7-14,  the  assembler  reorganizer  takes  the  interleaved  code  and  messes  it  up 
somewhat.  Because  x  and  y  are  not  directly  addressable,  the  assembler  reorganizer  must  build  the 
addresses  of  each  16  bits  at  a  time.  Because  of  the  interleaving  generated  by  the  compiler,  it  cannot 
use  one  register  for  both  x  and  y.  But  because  it  has  only  one  register  available  to  it  (i.e.at),  it  must 
load  and  reload  that  one  register. 

A  code  generator  could  use  two  temporary  registers,  one  each  for  the  top  1 6  bits  of  the  address  of  x 
and  y,  and  so  save  the  second  two  lui  instructions.®'’  Once  again,  macro  expansion  is  inhibiting  or 
defeating  other  optimizations. 


# 


9 


*‘'AJthough  th«  Mips  M'SOO  has  32  registors,  7  ars  rwarvad.  Thasa  ara  tha  zaro  register,  at,  ko,  ki,  gp,  sp,  and  ra.  ^ 

*''Note  that  the  assembler  reorganizer  could  certainly  save  one  lui  by  reversing  the  order  of  the  stores.  This  is  safe,  since 
it  is  evident  that  x  and  y  do  not  overlap  (i.a.,  there  is  no  aliasing  problem  to  be  reckoned  with  here,  so  the  ordering  can  be 
altered). 
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0x0: 

ten^reg: 

3C010000 

lui 

at , 0x0 

0x4  : 

8c220028 

Iw 

vO, 40 (at) 

0x8: 

3c010000 

lui 

at, 0x0 

Oxc : 

Bc23002c 

Iw 

vl,  44 (at) 

0x10 

20420001 

addi 

vO ,  vO ,  1 

0x14 

20630001 

addi 

vl, vl,  1 

0x18 

3c010000 

lui 

at, 0x0 

Oxlc 

ac220028 

sw 

vO,  40 (at) 

0x20 

3c010000 

lui 

at , 0x0 

0x24 

ac23002c 

svr 

vl,44(at) 

Figure  7-14:  Temporary  Register  Problem  -  Mips  M/500  Code 


7.6.  Shortcomings  of  Using  a  Single  Global  Pointer 

In  the  current  implementation,  the  Mips  compilers  load  global  registers  using  relocatable  or  indexed- 
relocatable  address  modes.  On  the  Mips  M/500.  this  is  translated  by  the  assembler  reorganizer  to  a 
sequence  of  instructions  that  always  performs  a  lui  instruction.  This  is  required,  since  the  linker 
may  have  relocated  the  target  address  so  that  the  upper  1 6  bits  are  significant. 

To  circumvent  this  problem  somewhat,  the  assembler  reorganizer  provides  two  data  segments  in 
addition  to  the  Unix  standard  of  .data  and  .bss  segments.®^  These  are  the  .sdata  and  .sbss 
segments,  which  are  equivalent  to  the  .  data  and  .bss  segments,  respectively,  except  that  they  are 
addressed  via  the  global  pointer  gp. 

The  gp  register  is  loaded  by  the  program  prelude,  and  the  initial  value  is  specified  by  the  linker.  The 
problem  with  this  scheme  is  that  it  limits  the  compilers  somewhat.  It  would  be  better  to  allow  the 
compilers  to  make  intelligent  decisions  on  register  allocation  based  on  variable  usage  rather  than 
restricting  them  by  the  requirements  of  the  assembler.  Two  specific  examples  of  how  compiler 
performance  could  be  increased  are: 

•  In  Fortran,  the  compiler  could  allocate  a  global  pointer  to  point  at  the  beginning  of  a 
common  block.  Currently,  the  compiler  must  always  use  relocatable  address  expres¬ 
sions,  which  require  two  Mips  M/500  instructions  to  fetch  an  address.  Using  based 
address  mode  (with  the  compiler  allocated  register)  requires  only  one  native  instruction. 

•  In  C,  array  accesses  are  performed  using  the  indexed  relocatable  address  mode,  which 
requires  three  Mips  M/500  instructions.  Array  accesses  could  be  simplified  into  two 
nsrtive  instructions  by  allocating  a  base  register  at  compile  time  for  those  arrays  which 
are  accessed  heavily  in  a  routine. 

The  problem  with  these  optimizations  is  that  currency,  they  are  'difficult.'  The  compiler  views  as  its 
target  architecture  the  Mips  pseudo-machine,  when  in  fact  it  should  be  generating  code  for  the  Mips 
M/500  native  machine.  On  the  pseudo-machine,  based  address  mode  is  no  more  complicated  than 
relocatable  mode  (whereas  on  the  real  machine,  they  are  quite  different).  For  this  and  other  op¬ 
timizations  to  be  feasible,  the  assembler  reorganizer  should  be  eliminated  (or  at  least  simplified),  and 
the  compilers  should  target  the  native  Mips  M^OO,  not  the  Mips  pseudo-machine. 


*^Th«  .  bs*  Mgmant  is  for  uninitializod  data,  which,  under  Unix,  defaults  to  being  initialized  to  zero.  The  .  data  segment  is 
for  aR  explicitly  initializad  data. 
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7.7.  Arithmetic  Optimizations  on  Native  Hardware 

To  save  execution  time,  the  assembler  reorganizer  will  substitute  a  multiply  with  a  sequence  of  shifts 
and  adds  whenever  possible  (see  section  3.2.1).  This  "optimization,"  however,  has  a  strictly  pee¬ 
phole  effect  in  that  it  can  sometimes  cause  a  program  to  run  slower  overall. 

extern  w, x, y, z; 


mult ( ) 

{ 

X  =  465*y  +  1890*z; 

} 

Figure  7-15:  Optimistic  Approach  to  Multiplication  -  C  Source 


Consider  the  source  code  fragment  shown  in  figure  7-15.  In  this  simplistic  example,  a  variable  is 
loaded  with  the  sum  of  two  products.  Since  the  compiler  only  knows  about  the  instruction  set  of  the 
Mips  pseudo-machine,  it  generates  the  instruction  sequence  shown  in  figure  7-16. 


# 


5 


X  =  465*y  +  1890*z; 
Iw  $14,  y 

mul  $15,  $14,  465 

Iw  $24,  z 

mul  $25,  $24,  1890 

addu  $8,  $15,  $25 
sw  $8,  X 


Figure  7-1 6:  Optimistic  Approach  to  Multiplication  -  Assembler  Source 


This  is  an  entirely  reasonable  thing  for  the  compiler  to  do,  since  it  has  been  told  that  a  multiply  is  a 
single  instruction.  As  shown  in  section  3.2.1,  however,  a  single  multiply  can  be  expanded  to  a  large 
sequence  of  shifts  and  adds.  In  this  case,  both  multiplications  are  by  constant  values,  so  this  is 
exactly  what  happens.  As  shown  in  figure  7-17,  the  first  multiply  is  translated  into  6  instructions,  and 
the  second  into  7  instructions. 


0x0: 

SfSeOOOO 

Iw 

t6,0(gp) 

0x4: 

8£980000 

Iw 

t8,0(gp) 

0x8: 

000e78c0 

8ll 

t7,t6,  3 

Oxc : 

01ee7823 

subu 

t7,t7,t6 

0x10 

000f7880 

all 

t7,t7,2 

0x14 

01ee7821 

addu 

t7,t7,t6 

0x18 

000£7900 

all 

t7,t7,4 

0x1  c 

01ee7821 

addu 

t7,t7,t6 

0x20 

0018c900 

all 

t9,t8,4 

0x24 

0338C823 

atibu 

t9,t9,t.8 

0x28 

0019C880 

all 

t9,t9,2 

0x2c 

0338C823 

avibu 

t9,t9,t8 

0x30 

0019C900 

all 

t9,t9,4 

0x34 

0338C821 

addu 

t9,t9,t8 

0x38 

0019c840 

all 

t9,t9,l 

0x3c 

01£94021 

addu 

t0,t7,t9 

0x40 

03e00008 

ra 

0x44 

•£880000 

sw 

t0,0(gp) 

Rgure  7-17: 

Optimistic  Approach  to  Multiplication  -  Mips  M/500  Code 
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Since  the  assembler  reorganizer  is  trying  to  discourage  the  use  of  the  actual  Mips  M/500  multiply 
instruction,  it  has  taken  efficient  code  and  translated  it  into  code  that  is  far  less  efficient  than  it  could 
be.  The  algorithm  to  convert  a  multiply  into  shifts  and  adds  is  quite  simple,  and  could  be  placed  in 
the  compiler  instead  of  the  assembler  reorganizer.  The  extra  information  needed  to  make  this  a 
worthwhile  investment  (i.e.,  the  semantics  of  the  arithmetic  operations  and  their  interactions  with 
other  variables)  also  resides  in  the  compiler. 


« 


5 


X  =  31* (15*y  +  63*z) ; 


Iw 

$14,  y 

mul 

$15,  $14,  15 

Iw 

$24,  2 

mul 

$25,  $24,  63 

addu 

$8,  $15,  $25 

mul 

$9,  $8,  31 

sw 

$9,  X 

Figure  7-18:  A  Better  Approach  to  Multiplication  -  Assembler  Source 


Since  the  compiler  knows  the  semantics  of  the  expression,  it  can  calculate  the  least  common 
denominators  of  the  multiplicands  and  rewrite  the  expression  into  what  at  first  appears  to  be  a  less 
optimal  form,  as  shown  in  figure  7-18.  This  form  includes  not  two,  but  three,  multiplications,  which 
seems  to  be  a  worse  implementation.  However,  when  this  code  is  fed  to  the  assembler  reorganizer 
(which  will  convert  the  multiplications  by  constant  values  to  shifts  and  adds),  we  get  what  is  shown  in 
figure  7-19. 


0x0: 

8£8e0000 

Iw 

t6,  0  (gp) 

0x4 : 

8£980000 

Iw 

t8, 0 (gp) 

0x8: 

000e7900 

sll 

t7,  t6,  4 

Oxc : 

01ee7823 

subu 

t7, t7, t6 

0x10 

0018c980 

sll 

t9,  t8,  6 

0x14 

0338c823 

subu 

t9,t9,t8 

0x18 

01f94021 

addu 

to, t7,t9 

Oxlc 

00084900 

sll 

tl,t0,  4 

0x20 

01284823 

subu 

tl,tl, to 

0x24 

03e00008 

ra 

0x28 

a£890000 

sw 

tl, 0 (gp) 

Figure  7-19:  A  Better  Approach  to  Multiplication  -  Mips  M/500  Code 


In  this  case,  the  multiplication  by  15  is  translated  into  2  instructions,  the  multiplication  by  63  into  2 
instructions,  and  the  multiplication  by  31  into  2  instructions.  The  net  result  is  that,  by  writing  a  more 
"pessimistic"  assembly  source,  we  can  reduce  the  actual  instruction  count  of  the  arithmetic  from  1 4 
instructions  to  7  -  a  reduction  of  50%. 


While  the  results  may  not  always  be  this  spectacular,  if  the  compiler  were  armed  with  knowledge  of 
the  real  Mips  M/500  assembly  language,  instead  of  relying  on  the  assembler  reorganizer  to  translate 
from  the  Mips  pseudo  instruction  set,  the  compiler  could  generate  more  efficient  code.  In  general, 
reducing  arithmetic  expressions  to  their  simplest  factored  form  can  allow  the  compiler  to  generate 
tighter  code.  For  most  architectures,  this  is  a  pessimization,  not  an  optimization.  Due  to  the  ex¬ 
pense  of  the  multiply  instmction,  however,  reducing  the  number  of  true  multiplies  by  increasinr  the 
number  of  shifts  and  adds  pays  off. 
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8.  Validation  of  MIPS  Pascal  Compiler 

This  chapter  describes  the  results  of  the  Pascal  validation  suite®^  as  applied  to  the  Mips  M/5O0.  The 
validation  suite  tests  the  Pascal  compiler  against  the  BS  6192:1982  "Specification  for  Computer 
Programming  Language  Pascal'^  and  reports  any  discrepancies.  In  this  chapter,  we  list  those 
discrepancies,  along  with  our  evaluation  of  the  ramifications.  The  discrepancies  are  listed  under  four 
categories:  portability,  conformance,  incorrectly  generated  code,  and  extensions.  In  all  cases,  the 
section  number  listed  to  in  the  discrepancy  reports  reference  the  section  number  in  BS  6192:1982. 
Please  note  that  this  chapter  refers  only  to  those  failures  which  the  validation  set  was  able  to  dis¬ 
cover;  it  does  not  report  on  those  tests  which  passed  correctly.  It  should  also  be  noted  that  the  Mips 
Pascal  compiler  is  a  level  0  implementation  of  Pascal,  which  is  to  say  that  it  does  not  support 
conformant  arrays.  According  to  the  Standard,  this  is  an  acceptable  reduction  in  compiler  strength, 
although  we  feel  that  conformant  array  support  is  still  desirable. 

According  to  Mips  Inc.,  their  Pascal  compiler  is  an  implementation  of  ANSI  standard  Pascal 
(ASNI/IEEE  770X3.97-1983).  As  such,  there  will  be  slight  differences  between  K  and  the  BS 
6192:1982  Pascal.  The  differences,  however,  are  far  fewer  in  number  than  we  found  as  dis¬ 
crepancies  in  the  following  sections. 


8.1.  Portability 


This  section  lists  those  features  under  which  the  Mips  Pascal  compiler  deviates  from  the  standard  in 
a  way  that  may  affect  program  portability.  Generally,  these  deviations  are  expressed  as  extensions 
to  the  language. 


Section 

Symptom  and  Comments 

6. 1.2-2 

The  unrestricted  words  otherwise,  return,  separate,  subtype,  double,  and  cobol  are 
reserved  words  in  Mips  Pascal. 

In  the  case  of  double,  return  and  otherwise,  MIPS  Pascal  is  providing  what  we  feel  to  be 
needed  extra  functionality  to  the  language.  The  other  additional  reserved  words  serve  other 
functions.  In  any  event,  this  is  a  legal  language  extension,  provided  it  is  documented. 

6. 1.6-5 

The  Mips  Pascal  compiler  allows  labels  to  exceed  the  range  of  1  ..9999. 

The  Pascal  standard  states  that  labels  must  be  restricted  to  the  range  of  1  ..9999.  By  allowing 
labels  to  exceed  that  limitation,  programs  developed  on  the  Mips  Pascal  compiler  may  have 
portability  problems.  In  practice,  however,  it  is  unlikely  that  a  programmer  would  use  a 
sufficient  number  of  labels  to  make  a  simple  translation  unfeasible. 

6. 1.6-6 

The  Mips  Pascal  compiler  allows  labels  to  contain  alphabetic  characters. 

The  Pascal  standard  statas  that  labels  must  be  numeric.  By  allowing  a  label  to  contain 
alphabetic  characters,  the  Mips  Pascal  compiler  presents  a  possible  portability  problem. 


*^The  Pascal  validation  suite  was  obtained  from  Software  ConsuRing  Services,  3162  Bath  Pke,  Naxareth,  PA  16046.  All 
test  programs  from  the  validation  suite  are  oopyrighted  by  A.  H.  J.  Sale  and  the  British  Standards  Institution,  1962. 

**Also  known  as  ISO  71 65,  and  available  from  the  British  Standards  Institute,  2  Park  Street,  London  W1 A  2BS,  England. 
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Section 

Symptom  and  Comments 

The  Mips  Pascal  compiler  allows  string  variables  to  be  stored  in  ordinary  (unpacked)  arrays. 

■ 

The  Pascal  standard  specifically  states  that  character  strings  are  of  type  packed 
array  [  1 .  .n]  of  char.  By  allowing  unpacked  arrays  to  hold  strings,  the  Mips  Pascal  com¬ 
piler  presents  a  possible  portability  problem.  In  general,  however,  this  is  a  rather  simple 
addition  to  the  language,  and  can  be  worked  around  easily  enough. 

€.1.7-11 

The  Mips  Pascal  compiler  allows  for  the  null  string. 

The  Pascal  standard  states  that  a  character  string  Is  a  sequence  of  characters  surrounded  by 
apostrophes  —  hence  there  can  be  no  null  string.  Although  this  introduces  a  portability  prob¬ 
lem,  we  do  not  feel  that  It  presents  any  real  issue.  ^ 

6 . 1 . 8 -5  The  Mips  Pascal  compiler  allows  the  expression 

i  :=  lOdiv  j; 
to  pass  without  error. 


The  expression  (notice  the  missing  space  character)  is  clearly  unambiguous,  even  though  it  is 
in  violation  of  the  standard.  We  do  not  expect  this  deviation  to  be  of  any  consequence. 

6.2. 1- 8  The  Mips  Pascal  compiler  allows  for  declarations  outside  of  the  standard-specified  order,  and 

6 .2 . 1- 9  for  multiple  declarations  of  any  given  type. 

€.2.1-10 

The  Standard  requires  that  declarations  be  in  the  order: 

1.  label 

2.  type 

3 .  const 

4 .  vex 

5 .  procedure/function 

Since  many  Pascal  compilers  allow  these  deviations,  we  feel  that  this  is  of  little  consequence. 

€.2.2-8  This  program  fragment  compiles  successfully,  even  though  it  is  in  violation  of  the  Standard: 

const 

red  m  1; 
violet  s  2; 

procedure  oucb; 
const 

m  K  red; 
n  »  violet ; 
type 

a  s  array [n-.n]  of  integer; 

ver 

V  :  a;  M 

color  :  (blue, red, indigo, violet) ; 
begin 

v[l] ;■!; 
color : Bred 
end; 

.  # 

The  Pascal  Standard  requires  that  the  defining-point  of  an  identifier  shall  precede  all  applied 

occurrences  of  that  identifier,  wKh  the  exception  of  pointer-type  declarations.  The  scope  of  an 
identifier  is  its  whole  region,  which,  in  most  cases,  is  a  block.  The  rules  prohibK  a  reference  to 
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Section 


Symptom  and  Comments 


(continued) 

an  outer  identifier  of  the  same  spelling  preceding  the  defining-point.  The  test  includes  two 
exactly  similar  violations  of  the  rules  in  the  use  of  the  identifiers  red  arKf  violet  in  the 
declarations  of  m  and  n.  The  Mips  Pascal  compiler  is  treating  the  declarations  in  a  top-down 
manner,  instead  of  considering  them  in  a  block-oriented  manner.  This  particular  error  is  very 
hard  for  a  1  -pass  front  end  to  get  right. 

•  6.2.2-12 

m 

The  Mips  Pascal  compiler  allows  an  applied  occurrence  of  a  type  to  be  in  the  same  scope  as  a 
field  designator  of  the  same  name: 
typo 

roe  -  rocord 

ptr  ;  ^'frod; 

£rod  :  intogor 
ond; 

£rod  =  roc; 

This  deviation  from  the  standard  presents  a  significant  portcbility  problem.  Should  a  program¬ 
mer  take  advantage  of  this  feature*,  it  could  be  rather  difficult  to  undo  its  use  when  attempting 
to  port  a  program  wriuen  on  the  Mips  M/500. 

®  6.3-2 

6.3- 4 

6.3- 5 

6. 7. 2. 2-5 

• 

• 

The  Mips  Pascal  compiler  allows  characters  and  booleans  to  be  signed,  for  example: 

eonot 

dot  =  '  . '  ; 
pluodot  o  +  dot; 

or: 

eonat 

truth  =  true; 
pluotruth  IB  4  truth; 

While  it  is  not  anticipated  that  a  programmer  would  use  this  feature,  the  failure  of  the  Mips 
Pascal  compiler  to  catch  this  error  suggests  that  hidden  program  flaws  may  pass  through  the 
compiler  undetected. 

6.3-6 

m 

This  program  deviates  because  constants  must  not  appear  in  their  own  definition: 

oonah 

h«n  *  10; 

procedure  p; 
const. 

ten  e  ten; 
begin 
end; 

The  Standard  explicitly  forbids  a  constant  to  appear  in  its  own  definition.  In  this  program,  the 
definition  ten  -  ten  is  in  the  scope  of  the  second  use  of  ten  and,  accordingly,  is  in  error. 
While  it  is  not  anticipated  that  a  programmer  would  use  this  feature,  the  failure  of  the  Mips 
Pascal  compiler  to  catch  this  error  suggests  that  hidden  program  flaws  may  pass  through  the 
compiler  undetected. 

6,3-7 

• 

The  Mips  Pascal  compiler  allows  the  value  nil  to  be  used  in  the  constant  definition  part: 

eeneb 

nothing  *  nil; 

This  deviation  allows  the  programmer  to  define  a  synonym  for  nil.  For  portability  purposes, 
this  presents  only  a  small  problem,  since  a  global  textual  substitution  will  solve  the  compilation 
problems. 
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Section 


Symptom  and  Comments 


6.3-9 


By  allowing  this  example  to  compile,  the  Mips  Pascal  compiler  deviates  from  the  Standard 
since  expressions  cannot  appear  in  a  constant-definition; 


const 

llnelengthsaO ; 
llneof  lo=li.n«l«ngth-<-l ; 


The  const-part  contains  definitions  of  identifiers  in  terms  of  simple  constants.  Standard  Pascal 
does  not  permit  expressions  to  be  used,  even  if  their  values  are  compile-time  determinable. 
The  authors  have  opposing  viewpoints  on  this  restriction.  Should  it  present  a  portability 
problem,  however,  it  is  easily  worked  around. 

6 . 4 . 1  -3  The  Mips  Pascal  compiler  allows  the  use  of  a  type  in  the  same  scope  as  its  definition: 

type 

X  K  Integer; 

procedure  p; 
type 

X  K  record 

y  :  * 

end; 

begin 

end; 


In  this  case,  the  definition  of  the  component  y  in  the  record  x  is  of  type  integer,  although  the 
scope  of  the  type  x  is  the  same  as  the  declaration.  Because  the  Mips  Pascal  compiler  allows 
this  to  compile,  it  suggests  that  the  compiler  does  not  place  a  type  in  the  symbol  table  until  it  is 
fully  defined.  While  it  is  not  anticipated  that  a  programmer  would  use  this  feature,  the  failure  of 
the  Mips  Pascal  compiler  to  catch  this  error  suggests  that  hidden  program  flaws  may  pass 
through  the  compiler  undetected.  It  will  also  present  a  nasty  portability  problem  if  this  feature 
is  used. 

6.4 .3 .2-5  Strings  must  have  a  subrange  of  integers  as  an  index-type.  The  following  fragment  compiles 
without  error, 
type 

ooloz  e  (red, blue, yellow, green) ; 
cll  e  blue.. green; 

ver 

s:  packed  array [cll]  of  char; 
begin 

a-.e'ABC'  ; 
end. 


# 


It  is  incorrect  to  have  a  subrange  of  an  enumerated-type  as  the  index-type,  even  if  the  ord  of 

the  lower  bound  is  one.  As  with  other  examples  of  this  type,  we  feel  it  unlikely  that  a  Pascal 

programmer  will  use  this  feature  of  the  Mips  Pascal  compiler.  However,  in  this  case  we  feel  H 

that  by  allowing  this  code  to  pass  through  without  error,  the  Mips  Pascal  compiler  is  allowing 

other,  perhaps  undetected,  errors  to  pass  through  by  equating  some  instances  of  sets  with 

integers. 
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6.4.3.3-18 


This  test  deviates,  since  all  values  of  a  tag-type  of  a  record  must  appear  as  case-constants. 


typ« 

color=  (pi.nk,  red,  green,  blue,  yellow)  ; 
colore d=record 
case  e: color  o£ 

pink: (p: array  [1..2]  of  color); 

rad: (r: array  [1..3]  of  color); 

blue, yellow: (b : array  [1..5]  of  color); 

and; 


This  deviation  is  another  of  little  consequence.  The  requirement  that  all  of  the  values  of  the 
tag-type  appear  as  case-constants  is  primarily  for  completeness.  The  actual  value  of  the  tag 
(in  this  case  c)  is  not  used  to  access  the  variant  part,  so  assigning  green  to  c  will  not  cause  a 
range  violation  error  on  the  variant  part. 


6.4.3.5-13 


This  test  deviates,  since  the  component-type  of  a  file-type  should  not  include  a  file-type. 


ver 

fl  :  file  of  text; 


The  Pascal  predeclared  entity  text  is  a  file-type.  By  allowing  this  program  fragment  to 
compile,  the  Mips  Pascal  compiler  is  introducing  a  possible  portability  problem.  It  appears 
easy  enough  to  change  in  the  source  program,  however,  to  merit  little  concern. 


6.4.3.5-14 


This  test  deviates  for  the  same  reason  as  6.4. 3. 5-1 3. 


tjTpe 


record 

fl  :  text; 
f2  :  file  of  char; 
end; 


ver 

f3  :  file  of  rec; 


6.4.5-12 


In  this  example,  the  compiler  is  essentially  allowing  a  file  of 
type.  This  will  generally  not  compile  on  other  Pascal  compilers. 

The  following  fragment  compiles  without  error: 


file  of  char  to  be  a  legal 


if  'CAT'  <  'HOOHD'  then 


The  Pascal  Standard  permits  compatibility  only  between  string-types  having  the  same  number 
of  components,  while  the  Mips  Pascal  compiler  allows  compatibility  between  different  string 
types.  This  is  a  nice  extension  to  the  language,  although  finding  and  correcting  all  such 
instances  in  a  program  to  be  ported  could  prove  to  be  a  difficult  venture. 
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6.4.5-16 


This  test  violates  the  type  rules  for  relational-operators  using  sets  as  operands. 


type 

BType  =  eat  of  boolean; 

PType  =  packed  aet  of  falae. .true; 

var 

flag: boolean;  B: BType;  P: PType; 
begin 

B [true, false] ; 

P;=[true] ; 

flag:=(B  >=  P);  {  B,P,  incompatible  ) 


6. 4. 6-4 


A  relational-operator  between  values  of  a  set  type  can  either  have  compatible  operands  or  be 
of  the  same  canonical  set-of-T  type.  In  this  instance,  the  T  is  not  the  same  (one  packev  the 
other  unpacked).  The  Mips  Pascal  compiler  makes  no  distinction  between  packed  and  un¬ 
packed  datatypes,  so  this  is  of  little  consequence  on  the  Mips  machine.  However,  serious 
difficulties  could  arise  in  porting. 

This  test  deviates,  since  assignment  of  reals  to  integers  is  not  permitted. 


real; 

integer; 


begin 

r  :■ 

i 

end. 


6.0; 


■r; 


6 


4 . 6-6 


The  Pascal  Standard  allows  assignment  of  Integers  to  reals,  but  not  reals  to  integers.  To 
perform  this  latter  assignment,  the  program  writer  must  use  the  explicit  built-in  functions  trunc 
or  round  (the  MIPS  Pascal  compiler  is  performing  an  implicit  trunc  operation).  While  this 
feature  is  of  little  consequence  on  the  Mips,  chasing  down  all  instances  of  this  feature  in  a 
program  to  be  ported  could  prove  harrowing. 

The  Mips  Pascal  compiler  allows  this  program  to  compile  and  execute  without  error: 


type 

xekord  >  record 

f  ;  t«xt; 
a  :  lnt*g«.c 
•nd; 

▼ar 

racordl  :  rekord; 
r*oord2  :  rokord; 
begin 

racordl .a:*!; 
rewrite (racordl . f ) ; 
rewrite (record2 . £) ; 
records : wrecordl ; 
writeln('  DEVIATES. . .6. 4.6-6' ) 
end. 


Structured-types  containing  a  file  component  should  not  be  assigned  to  each  other.  The 

Pascal  Staridard  states  that  the  two  types  T1  and  T2  (in  determining  assignment  compatibility)  # 

must  not  be  a  structured-type  with  a  file  component.  This  feature  of  the  Mips  Pascal  compiler 

seems  to  be  a  little  more  threatening  regarding  the  portability  issue. 
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6. 5. 4-4 

This  program  deviates  because  a  function-identifier  cannot  be  used  as  a  pointer-variable, 
type 

ptr  =  ^'Integer; 

var 

p  :  ptr; 

• 

m 

function  f  :  ptr; 
var 

p  :  ptr; 
begin 
new (p) ; 
f  :=  p; 
f''  :=  10 
end; 

begin 

p  :=  f; 
writelnfp'^)  ; 
end. 

The  Mips  Pascal  compiler  takes  a  short-cut  and  treats  a  function-identifier  as  a  local  variable 
when  it  appears  on  the  left-hand  side  of  an  ctssignment.  This  is  illegal  according  to  the 
Standard,  and  presents  a  noticeable  portability  problem. 

6. 6. 1-3  This  program  shows  that  a  procedure  call  is  incorrectly  bound  to  the  wrong  defining  occur¬ 


rence. 

procedure  p; 
be^ln 

writeln('  OOTER  PROCEDURE ' ) 
end; 

procedure  q; 

procedure  qq; 
beqin 
P 

end; 

procedure  p; 
begin 

writeln('  ZHliER  PROCEDURE' ) 
end; 

begin 

<J<J 

end; 

begin 

<?-• 

end. 


Since  the  applied  occurrence  is  before  the  defining  occurrence  (in  qq),  the  program  deviates. 
The  Mips  Pascal  compiler  should  issue  a  compile  time  error  indicating  that  the  procedure  p  is 
not  declared  at  the  time  of  its  use.  Instead,  it  uses  the  outer  procedure,  even  though  the  scope 
of  the  inner  procedure  overrides  it.  If,  within  the  procedure  q,  the  procedure  p  is  declared  to  be 
of  type  forward,  the  inner  procedure  is  called,  alluding  to  the  linear  creation  of  the  symbol 
reference  table  within  the  compiler. 
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6.1-4 


This  program  shows  another  example  of  a  procedure  binding  to  the  wrong  occurrence: 


v»r 

i : integer ; 
procedure  p; 
begin 

i  :=  ord(  ) 

end; 

function  ord(c:cher):  integer; 
begin 

ord  :=  -  mexint 
end; 

begin 

p; 

end. 


6. 6. 3. 2-1 


This  test  uses  a  standard  function  rather  than  nested  procedures.  We  feel  that  it  is  unlikely 
that  a  programmer  will  redefine  a  built-in  function  in  this  manner.  However,  the  Mips  Pascal 
compiler  should  nonetheless  issue  an  error  message  for  this  program. 

The  assignment  compatibility  rules  prohibit  a  type  with  a  file  component  being  used  as  a  value 
parameter. 


type 

£  =  record 

x:  integer; 
y;  text 
end; 

var 

v:  £; 

procedure  p(  q:  £) ; 
begin 

rewrite (  q . y  ) 
end; 
begin 

v.x  :=  1; 

P  (▼) ; 
end. 


fi.£.3.3-4 


Since  a  file  is  conceptually  an  area  on  a  secondary  storage  medium,  it  cannot  have  a 
’Value".  By  allowing  a  file  to  be  passed  as  a  value  parameter,  the  Mips  Pascal  compiler 
introduces  a  severe  portability  problem. 

This  test  deviates,  since  an  actual  variable  parameter  shall  not  denote  a  field  which  is  the 
selector  of  a  variant-part. 


type 

•hape 

figure 


▼ar 

ptr 


(triangle, rectangle) ; 

record  9 

area  : real ; 
case  a  : shape  of 

triangle  :  (base, height  :real); 
rectangle:  (sidel,side2  :real) 
end; 

^figure;  ® 
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(continued}  procadura  £indax«a(var  ■  :  ahapa) ; 

bagin 

eaaa  a  of 

friangla  : 

ptr^'.araa  :=  (ptr'' .baaa*ptr'' .haigbt) /2; 
raotangla : 

ptr^ . axaa  : =  pbr^ . aidal *pbr ^ . aida2 
and 
and; 
bagin 

naw(ptr)  ; 

ptr'^.a  :s  ractangla; 

ptr^. aidal  :=  3; 

ptr''.aida2  :=  4; 

findaraa  (ptr'‘ .  a )  ;  {  illagal ) 

if  ptr'' .  araa  b  12  than 

writaln('  VAR  PARAMETER  PASS IMG '  ) 
alaa 

writaln ( '  VAR  PARAMETER  DEVIANCE ' ) 

and. 


This  deviation  opens  the  door  to  some  major  problems.  What  the  Mips  Pascal  compiler  is 
allowing  the  user  to  do  is  the  following:  A  variant  record  is  used  with  one  part  of  the  variant  in 
one  place  in  the  program.  While  this  variant  part  is  in  use,  the  variant  is  passed  by  reference 
to  another  routine,  which  then  has  the  liberty  to  change  the  selector  field  -  without  ’advising' 
the  caller  of  the  routine.  Although  the  Mips  Pascal  compiler  is  getting  the  value  of  area  right 
(i.e.,  it  is  12),  it  is  providing  a  major  loophole  in  the  Pascal  type  checking  rules  (effectively 
permitting  Fortran  equivalencing  or  the  unconstrained  c  union  operator  in  a  language  which 
forbids  this  type  of  construct). 

fi .  6 .3 .3-5  This  program  deviates  from  the  standard,  since  an  actual  variable  parameter  may  not  denote  a 
component  of  a  packed  variable. 


type 

oexd  e  pecked  errey[1..60]  of  ober; 

ver 

image  ;  card; 

function  baadercard (var  coll  :cbar)  :  boolean; 
begin 

if  coll  *  'B'  tben 
beadercard  :s  true 

else 

beadercard  ;k  falsa 

and; 

begin 

image [1]  ; 

if  beadercard (image  1 1 ] )  tben 

writeln('  VAR  PARAMETER  PAS5IMC(1)') 

else 

writelaf'  VAR  PARAMETER  PAS5INC(2)') 

end. 


The  Mips  Pascal  compiler  considers  packed  and  unpacked  arrays  and  records  to  be  equiv¬ 
alent,  thus,  for  the  Mips,  this  deviation  from  the  standard  is  of  little  consequence.  However,  for 
portabilit/s  sake,  this  feature  should  be  changed. 
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6.6.3.6-10 


The  Mips  Pascal  compiler  does  not  adhere  to  standard  parameter  list  congruity  rules: 


v«r 

aa , bb  :  intagar ; 


procadura  p(procadura  formal (var  a,b 
bag  in 

formal (aa,bb) 
and; 


intagar) ) ; 


procadura  actual (var  a  :  intagar;  var  b  :  intagar) ; 
bag  in 

wr ita In ( '  DEVIATES ' ) 
and; 


bagin 

p (actual) 
and. 


6.7.1-10 


This  example  merely  points  out  a  simple  extension  to  Pascal  (and  thus,  a  small  portability 
problem),  since  the  declaration  parts  of  formal  and  actual  are  essentially  identical. 

Although  the  compiler  should  generate  errors  for  each  of  the  three  string  assignments,  if 
generates  errors  only  for  the  last  two; 


var 

atringl  :  pac)cad  axray[1..4]  of  char; 
strings  :  packed  array[1..6]  of  char; 
bagin 

atringl ;='AB' ; 

Strings : ^atringl ; 
str ingl : s ' ABCDEFC ' ; 
and. 


6. 7. S. 5-6 


The  Pascal  Standard  states  that  string  types  are  compatible  only  if  they  have  the  same  number 
of  components.  The  Mips  Pascal  compiler  is  allowing  assignment  of  one  string  type  to 
another,  padding  out  with  spaces  if  they  are  not  of  the  same  length,  when  the  source  of  the 
string  assignment  is  a  string  constant.  While  this  is  felt  to  be  a  reasonable  action,  it  may  pose 
portability  problems. 

The  Mips  Pascal  compiler  allows  assignments  and  comparisons  on  records  and  arrays: 


var  c, d  :  record 

fl  :  integer; 

£2  :  real 
end; 

begin 

c.fl  :*  0; 

c.f2  :>  3.1;  # 

if  (c  O  d)  then 
o  d; 


end. 


This  is  a  rather  nice  extension  to  the  Pascal  Standard,  which,  unfortunately  will  cause  some 
big  headaches  in  porting.  The  comparisons  are  implemented  on  a  component-by-component 
basis,  as  are  the  assignments  (i.e.,  they  are  done  correctly).  However,  aKhough  this  shortcut 
is  nice  to  have,  it  will  prove  annoying  to  anyone  porting  a  program  originally  written  under  the 
Mips  Pascal  compiler. 
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6. 8. 1- 1 

6. 8. 1- 2 

The  Mips  Pascal  compiler  allows  gotos  between  alternative  arms  of  a  conditional  statement 
and  case  statements: 

j 

i:=5; 

if  (i<10)  ^hen 
goto  1 

else 

l:«rrit«('  DEVIATES ...fi .8 . 1-1 ; 
if  (i>10)  then 

2:writeln('  GOTO  ALTESMATE  BRANCH  OF  IF'} 

else 

goto  2 

A  conditional  (or  case)  statement  is  considered  a  compound  statement  by  the  standard.  A 
goto  may  only  reference  a  simple  statement  and  may  not  reference  a  part  of  a  compound. 
One  of  the  reasons  for  this  restriction  is  to  prohibit  code  that  skips  over  loop  initialization  code 
(see  6.8.1 -4  below)  or  block  initialization  code  (see  6.8.1 -7  in  section  8.3).  In  general,  the  Mips 
Pascal  compiler  is  implementing  the  semantics  of  c  in  allowing  this  feature.  Programs  that 
utilize  this  feature  will  be  unportable  or  may  produce  unpredictable  results  on  other  compilers. 

£.8.1-4 

£.8.1-5 

The  Mips  Pascal  compiler  allows  a  goto  in  the  middle  of  a  for  loop: 
j  0; 

for  i  :=  1  to  0  do 
begin 

100; 

writeln(' OOFS' ) 
end; 
i  :=  0; 
if  j  =  0  then 
goto  100 

This  feature  is  just  asking  for  trouble  in  that  it  allows  the  initialization  code  of  a  loop  to  be 
skipped.  A  "clever"  programmer  could  use  this  feature  to  advantage  but  would  be  violating  the 
Pascal  standard.  It  is  interesting  to  note  that,  if  the  goto  is  coded  as  in  the  example,  the  string 
"OOPS"  is  printed.  If,  however,  the  goto  is  coded  as  a  non-local  goto,  no  message  is  printed. 
We  feel  that  this  particular  feature  is  a  dangerous  one  to  include  in  a  production  language  - 
especially  when  it  is  disallowed  by  the  Pascal  Standard.  In  general,  the  Mips  Pascal  compiler 
is  implementing  the  semantics  of  c  in  allowing  this  feature.  Programs  that  use  this  feature  will 
be  unportable  or  may  produce  unpredictable  results  on  other  compilers. 

£.8. 3. 5-7 

Subrange  lists  are  allowed  in  case  elements: 

case  too  of 

1 . . 4 :  wxlteln ( ' low' ) ; 

5;  wxit«ln('hlgb' ) 

•nd; 

According  to  the  Standard,  only  lists  of  case  elements  (i.e.,  1,2, 3, 4)  are  allowed  in  case 
elements,  and  not  subranges  (i.e.,  1 . .  4).  This  is  a  simple  extension  to  the  language  and 
should  not  present  too  much  of  a  portability  problem.  Difficulties  will  arise  when  a  range  that 
includes  elements  of  a  set  is  used,  sirwe  it  is  not  as  obvious  a  list  as  integers. 
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6. 8. 3. 9- 6 

6. 8. 3. 9- 7 

6. 8. 3. 9- 8 

6. 8. 3. 9- 9 


The  Mips  Pascal  compiler  allows  a  loop  control  variable  to  be  passed  as  a  var  parameter; 
var 

i :  i.n'tagar ; 


6.8.3.9- 10 

6.8.3.9- 15 

6.8.3.9- 16 

6.8.3.9- 21 


procadtira  varynaaty  (var  nrintagar); 

bagin 

and; 


6.8.3.9- 22 

6.8.3.9- 24 


bagin 

for  i:=l  to  10  do 
bagin 

varynaaty (i) 
and; 

writaln('00PS' ) 
and. 


In  this  example,  the  procedure  varynaaty  can  change  the  value  of  the  loop  control  variable. 
This  threat  is  prohibited  by  the  Standard,  and  by  allowing  it,  the  Mips  Pascal  compiler  intro¬ 
duces  a  nasty  portability  (aryf  debugging)  feature.  Other  threats  that  the  compiler  allows  to 
p?ss  through  undetected  are: 


•  Using  a  non-local  variable  as  a  loop  control  variable. 


•  Using  a  global  variable  as  a  loop  control  variable. 


•  Using  a  local  variable  for  loop  control,  but  permitting  its  use  in  another  local 
procedure. 

•  Modifying  the  loop  control  variable  with  a  read  statement. 

•  Using  an  actual-value  parameter  as  a  loop  control  variable. 

•  Using  the  value  of  the  loop  control  variable  after  loop  execution  has  completed. 

•  Allowing  the  value  of  the  loop  control  variable  to  extend  past  the  legal  subrange  of 
the  variable. 


6.9.1- 11 

6. 9. 3. 1- 6 

6. 9. 3. 6-3 


As  in  other  cases,  this  implementation  follows  the  unconstrained  semantics  seen  in  c,  and 
should  be  changed. 

The  Mips  Pascal  compiler  allows  values  of  type  other  than  integer,  real,  and  character  to  be 
read  and  written  from/to  a  text  file: 


var 


one : boolean; 


fl  :t.ext; 


begin 

rewrite (f 1) ; 
one  :  ■>  true ; 
wrltela(fl,ene) ; 
reset (f 1) ; 
read (fl, one) ; 
end. 


Although  this  is  a  dear  deviation  from  the  standard,  other  than  creating  a  portability  problem, 
we  feel  that  this  extension  is  a  valid  one.  Since  the  Mips  Pascal  compiler  considers  packed 
and  unpacked  arrays  of  characters  to  be  equivalent,  It  also  allows  reads/writes  of  packed 
arrays.  This,  too,  is  valid  extension. 
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6.10- 1  In  the  program  specification,  declaring  output  is  not  required.  Also,  a  file  may  be  a  program 

6.10- 7  parameter  but  not  be  declared. 


In  the  former  case,  the  Mips  Pascal  compiler  is  adhering  to  the  standard  but  deviating  from 
Jensen  and  Wirth.  In  the  latter,  the  type  of  the  variable  may  be  inferred.  In  both  cases,  we  feel 
this  is  of  small  consequence. 


8.2.  Conformance 


This  section  lists  those  features  under  which  the  Mips  Pascal  compiler  deviates  from  the  standard  in 
a  way  that  may  affect  program  compilation.  Generally,  these  deviations  are  expressed  as  failures  of 
the  language  to  meet  certain  minimum  requirements. 


Section 

Symptom  and  Comments 

6. 1.5-2 

A  program  with  a  very  large  floating-point  number  (i.e.,  an  integer  part  with  3  digits  followed  by 
a  35  digit  fraction)  causes  the  compiler  to  issue  a  fatal  error  In  ugen. 

const 

reel  <=  123.45678901234567e»0123456789012345678»; 

The  compiler  should  allow  an  arbitrary  length  floating-point  number  to  be  expressed  in  Pascal. 
Whether  this  value  can  be  accurately  represented  in  an  internal  form  is  irrelevant  -  the 
compiler  must  accept  the  number  as  input. 

6. 2. 3. 5-1 
6.4.3,3-11 

The  Mips  Pascal  compiler  does  not  detect  the  use  of  an  uninitialized  variable; 

procedure  q; 

▼sr 

1, j  :  integer; 
begin 
i;=2; 
j:=3 
end; 

proced\ire  r; 
vsr 

i ,  j  :  integer ; 
begin 

j  i-4; 

writeln('  THE  VALUE  OF  I  IS  '  ,  i) 
end; 

begin 

r 

end. 

• 

The  value  printed  out  for  i  is  0,  which  happens  to  be  the  value  that  was  in  the  register 
allocated  for  i  when  the  program  was  compiled.  The  same  kind  of  unpredictable  behavior 
occurs  when  an  uninitialized  portion  of  a  variant  record  is  used.  The  compiler  should  report  the 
use  of  a  variable  before  it  is  initialized  (as  is  done  with  lint  lor  the  c  compiler).  Instead,  no 
irtdication  is  given.  We  feel  that  this  is  a  shortcoming  of  the  compiler. 
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6. 4. 2. 4-5 


Using  strings  in  a  subrange  declaration  crashes  the  compiler  with  the  error  'Fatal  error"  and  no 
line  number  indication. 


firstindex  =  'AB'  ..  'CD'; 


While  this  program  fragment  is  illegal,  the  ungraceful  error  handling  of  the  compiler  is  unac¬ 
ceptable.  At  least  a  specific  error  message  should  be  printed.  However,  the  compiler  simply 
dumps  core  and  terminates  execution. 


6.4.3.3-10 


The  Mips  Pascal  compiler  does  not  generate  an  error  when  accessing  a  field  of  an  inactive 
variant; 


type 

two  =  («,  b)  ; 

v«r 

▼eriant  :  record 

ceee  teg£leld;two  of 
*:  (m:  Integer) ; 
b:  (n: integer) 
end; 

i  :  integer; 
begin 

variant .tagfieldt^x; 
variant .m;sl; 

it^ariant.n;  (illegal) 

end. 


6. 4. 4-4 


This  deviation  is  another  of  little  consequence.  The  requirement  that  all  of  the  values  of  the 
tag-type  match  the  access  type  is  primarily  for  completeness.  The  actual  value  of  the  tag  (in 
this  case  a)  is  not  used  to  access  the  variant  part,  so  accessing  n  while  the  variant  part  is  set 
to  c  should  not  cause  any  problems  (even  though,  technically,  an  error  message  should  be 
printed). 

This  program,  which  tests  that  the  domain  type  of  a  pointer  type  may  be  a  file  type,  generates 
a  segmentation  fault: 


type 

£ileptr  e  ^text; 

var 

ptrl,ptr2,ptr4  :  fileptr; 


procedure  copyandadd (var  £rofii£ila, tofile :text;  cbrcbar); 
begin 

while  not  eolnffrccafile)  do 
begin 

write (to£ile, fromfile^) ; 

get ( fromfile ) 

end; 

write (tofile, oh) ; 

reset (froofile) ;  reset (tofile) 

and; 

procedure  swapptr(var  first, second: fileptr) ; 

var 

helpptr  :  fileptr; 
begin 

helpptr  :e  first;  first  second;  second  ;w  helpptr 
end; 
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proo«dur«  ch*ckcont«nt;s  (thaf  il«:  filaptr;  •]qp«ct«dvalu«:ljit«g«r)  ; 
v»r 

•ctualvalua  :  i.n't*g«r; 
begin 

readln (bhef ile'', ectualvalue)  ; 
and; 

begin 

new(ptrl) ;  new(pbr2);  naw(ptr4) ; 

rewrite  (ptrl^)  ;  rewrite  (ptr2'~)  ;  rewrite  (ptr4'‘)  ; 

write  (ptrl'',  '  1'  )  ; 

reaat  (ptrl'^)  ; 

copyandadd  (ptrl  ,  ptr2 ’  4 '  )  ; 
swapptr (ptr2 , ptr4 ) ; 
cbeokcontenta (ptr4, 1) ; 
end. 


6.4.5- 15 

6. 4. 6- 9 


This  example  fails  due  to  some  internal  consistency  error  in  the  run-time  library.  Whatever  the 
cause,  the  Pascal  run-time  should  never  dump  core,  but  should  issue  some  reasonable  run¬ 
time  error  message. 

The  Mips  Pascal  compiler  does  not  always  detect  out-of-range  errors  correctly,  even  when  the 
-c  switch  is  used: 


6.4.6- 10 

6.4.6- 12 
6. 7. 2. 4-4 


type 

aubrange  =  0 . . 5 ; 

war 

i  :  aubrange ; 

proeedxire  test  (a  :  subrange) ; 
begin 

writeln('  THE  V3U.DE  OF  A  IS  a)  ; 
end; 


begin 

i:«5; 

test(i*2);  {  error  } 
end. 


In  this  specific  example,  the  compiler  is  able  to  track  the  value  of  i  into  the  procedure  test 
when  the  optimizer  is  enabled  arxi  when  range  checking  is  enabled.  If,  however,  the  optimizer 
is  not  used,  or  H  range  checking  is  not  explicitly  enabled,  no  error  message  is  issued.  While 
the  latter  is  an  acceptable  constraint,  we  do  not  feel  that  the  presence  of  the  optimizer  should 
influence  range  checking.  In  this  example,  and  many  others,  range  checking  was  only  per¬ 
formed  at  compile  time,  not  at  run-time.  In  addition  to  parameter  passing,  range  checking  also 
fails  with: 

•  simple  variable  assignments 

•  array  indexing 

•  incompatible  (non-overlapping)  set  assignments 

•  sets  passed  as  parameters 

This  is  very  bad  behavior  for  a  Pascal  compiler  to  exhibit,  especially  since  Pascal  is  supposed 
to  be  a  strongly  typed,  range  checking  language.  Note  again  that  these  errors  occurred  even 
when  range  checking  was  ertabled  during  compilation. 
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6. 5. 5- 2 

6. 5. 5- 3 

The  run-time  error  in  this  program  is  not  detected; 

ver 

fyle  :  text; 

procedure  naughty (ver  £  :  cher) ; 
begin 

if  £='G'  then 
put (fyle) 

end; 

begin 

rewrite (fyle) ; 
fyle^;  =  'C'  ; 
neughty (fyle^ )  ; 
end. 

This  program  causes  an  error  by  changing  the  current  file  position  of  a  file,  while  the  buffer- 
variable  is  an  actual  variable  parameter  to  a  procedure.  The  error  should  be  detected  by  the 
run-time. 

6. 6. 3. 1-9 

The  following  program  fragment  does  not  compile: 
type 

t  *  0.  .10; 

function  f(  t:  integer):  t; 

The  error  that  is  given  is  that  t  (the  second  instance)  is  "Identifier  is  not  of  appropriate  class". 

The  problem  is  that  the  compiler  is  not  keeping  type  declarations  and  variable  declarations  in 
different  name  spaces.  The  declaration  of  a  local  variable  t  correctly  overrides  all  other 
enclosing  declarations.  However,  the  declaration  also  obscures  the  declaration  of  the  type  t, 
which  is  incorrect. 

£.6. 3. 2-3 

The  Mips  Pascal  compiler  passes  all  arrays  by  reference,  regardless  of  the  presence  of  a  vex 
qualifier. 

€.6. 3. 5-2 


This  is  bad  news  for  portability.  It  is  acceptable  for  a  Pascal  compiler  to  pass  a  non-var  array 
by  reference,  provided  it  is  treated  as  a  read-only  array  in  the  called  routine.  However,  the 
Mips  Pascal  compiler  does  not  even  do  this  check,  and  simply  passes  the  address  of  the  array 
into  the  routine,  allowing  full  access  to  the  array  body.  Truly,  it  is  very  inefficient  to  copy  the 
entire  contents  of  an  actual  array  parameter  into  a  formal  array  parameter,  but  if  that  is  the 
action  desired  by  the  programmer  (and  demanded  by  the  Standard),  then  the  compiler  must 
perform  this  action. 


The  Mips  Pascal  compiler  does  not  check  for  function  return-type  congruity: 
type 

neturalBO .  .maixint; 

▼ex 

k: integer; 

function  ectuel (i:neturel) ineturel; 
begin 

ectuel : ei 
end; 

procedure  p  (function  fomeI(i:neturel)  :  integer ) ; 
begin 

k:>forniel(10) 

end; 
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begin 

p (actual) ; 
and . 

The  return  types  of  the  function  formal  do  not  match  those  of  the  function  actual.  This  is  a 
severe  portability  problem  because  the  compiler  does  not  check  for  an  incompatibility  that 
other  compilers  will  surely  complain  about.  In  addition,  it  violates  the  strongly  typed  nature  of 
Pascal. 

I  I 

The  Mips  Pascal  compiler  does  not  check  for  parameter  list  congruity,  whether  the  parameters 
are  of  type  var  or  not: 

program  failtiro  (output)  ; 
type 

natural  =  0 . .marint; 

procedure  actual ( i : integer ;  n : natural ) ; 
begin 
i  :=n 
end; 

procedtire  p(proced\ire  formal  (a:  integer;b :  integer) )  ; 
var 

Ic ,  1 :  integer ; 
begin 

lc:=l;  1:»2; 
formal (k,l) 
end; 

begin 

p (actual) ; 
end. 

The  parameter  types  of  the  procedure  formal  do  not  match  those  of  the  procedure  actual. 
This  is  a  severe  portability  problem.  In  addition,  it  violates  the  strongly  typed  nature  of  Pascal. 

€.6.5.2-19 

Calling  the  built-in  function  get  with  no  parameters  causes  a  fatal  error  in  /u.er/l  ib/upas. 

i 

This  shortcoming  in  the  Mips  Pascal  compiler  is  indicative  of  the  rather  sparse  error  recovery 
system  built  into  the  compiler.  Section  8.5  discusses  this  shortcoming  in  more  detail. 

■ 

The  Mips  Pascal  compiler  fails  to  detect  when  a  function  assignment  is  not  executed: 

function  area (a  :  real)  :  real; 
var 

X  :  real ; 
begin 

if  a  >  0  than  x:-3.1415926*a*a 

elae  arearmO 

end; 

• 

)»egin 

writeln (area (2.0)); 
and. 

Th«  Pascal  Standard  states  that  the  resuft  of  a  function  will  be  the  last  value  assigned  to  its 
identifier.  If  no  assignment  occurs,  then  the  result  is  undefined.  The  Mips  Pascal  compiler  is  in 
error  by  not  detecting  this  fact. 
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6. 6. 5. 2- 5 
€.6. 5. 2-6 

6. 6. 5. 2- 7 

6. 6. 5. 2- 9 

6.6.5.2- 10 

6.6.5.2- 12 

6.6.5.2- 13 


This  test  fails  to  cause  an  error  by  applying  'reset'  to  an  undefined  file: 
var 

£  :  £il«  of  infcagar; 
begin 

reaet (f ) ; 

•nd. 


6.6.5.2-14 


6.6.5.2-15 

6. 6. 6. 5- 6 

6. 6. 6. 5- 7 

6. 6. 6. 5- 8 


This  is  another  example  of  the  Mips  Pascal  compiler  allowing  uninitialized  variables  to  be  used 
in  expressions.  Other  errors  involving  files  include: 

•  Allowing  a  get  following  a  rewrite. 


•  Allowing  a  read  of  a  type  incompatible  with  the  file  type. 


•  Allowing  a  write  of  a  type  incompatible  with  the  file  type. 

•  Allowing  a  get  of  a  type  incompatible  with  the  file  type. 

•  Allowing  a  put  of  a  type  incompatible  with  the  file  type. 

•  Allowing  a  get  past  the  end  of  a  file. 

•  Allowing  a  put  to  an  undefined  buffer  variable. 

•  Allowing  an  eof  to  an  undefined  file  variable. 

•  Allowing  an  eoln  while  eof  is  true. 

•  Allowing  an  eoln  to  an  undefined  file  variable. 

The  only  error  that  is  defected  correctly  is: 


•  A  put  on  a  file  not  open  for  writing. 


6. 6. 5. 3- 6 

6. 6. 5. 3- 7 

6. 6. 5. 3- 8 

6. 6. 5. 3- 9 


The  following  example  fails  to  detect  the  use  of  a  pointer  after  it  has  been  disposed: 
type 

pointer  ''integer; 

v«r 


6.6.5.3- 10 

6.6.5.3- 11 

6.6.5.3- 13 

6.6.5.3- 14 

6.6.5.3- 16 

6.6.5.3- 17 

6.6.5.3- 21 


p  :  pointer; 
begin 

new(p) ; 

p-  :=  10; 

dispose (p) ; 
writelnfp'')  ; 
end. 


The  Mips  Pascal  compiler  and  run-time  is  not  performing  any  checks  on  the  validity  of  pointers, 
including: 

•  Allowing  a  dispose  on  a  pointer  whose  value  is  currently  active  as  a  var 
parameter. 

•  Allowing  a  dispose  on  a  pointer  which  is  currently  being  referenced  by  a  with 
statement. 

•  Allowing  the  use  o'  a  pointer  after  it  has  been  disposed. 

•  Allowing  the  use  of  a  pointer  that,  through  assignment,  was  equal  to  another 
pointer  that  has  been  disposed. 

•  Allowing  a  generic  dispose  on  a  pointer  referencing  a  variant  record,  or  passing 
different  or  the  wrong  number  of  parameters  to  the  long  form  of  dispose  (this  is 
merely  a  portability  problem,  since  the  Mips  Pascal  compiler  uses  the  gerteric  Unix 
memory  allocation  mechanism). 
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•  Allowing  a  reference  (either  left  or  right-hand  side,  or  parameter)  to  the  pointer  p* 
when  p''  refers  to  a  variant  record  (i.e.,  a  reference  other  than  to  a  component  of 
the  record).  This  results  in  a  potentially  illegal  copying  of  differing  variant  record 
components. 


•  Allowing  the  activation  of  a  variant  part  other  than  that  created  by  a  call  to 
new (p,  cl ,  c2  . . . ) • 


6. 6. 5. 4- 2 
6. e. 5. 4-3 

6.6.5.4- 4 
6. €.5. 4-5 


All  of  these  failings  of  the  Mips  Pascal  compiler  are  dangerous  ones.  The  first  four  are  classic 
problems  of  the  c  run-time  library  that  should  be  fixed  in  a  type  and  range  checking  language 
such  as  Pascal.  The  last  failing  presents  a  severe  problem,  since  only  the  minimum  space  is 
allocated  in  the  call  to  new,  and  activating  a  different  variant  part  may  write  to  other,  unrelated 
areas  of  memory.  All  of  these  errors  should  be  fixed. 

The  Mips  Pascal  compiler  and  run-time  fail  to  detect  that  the  ordinal  type  parameter  to  the 
built-in  procedure  pack  is  not  assignment  compatible  with  the  index  type  of  the  unpacked 
array  parameter: 


6. €.5. 4-6 
6. 6. 5. 4-7 


type 

paJc  =  packed  array  [  0  .  .  15  ]  of  boolean ; 
var 


a;  array  [  1  . .  16  ]  of  boolean; 
s;  pak; 
i:  1  ..  16; 


begin 

for  i  :s  1  to  16  do 
a[i]  true; 
pack(a,  0,  *)  ; 
end. 


The  Mips  Pascal  compiler  is  not  performing  the  following  checks  on  arrays: 

•  Not  detecting  that  the  ordinal  type  parameter  to  the  built-in  procedure  pack  (or 
unpack)  is  not  assignment  compatible  with  the  Index  type  of  the  unpacked 
(packed)  array  parameter; 

•  Allowing  pack  (unpack)  to  be  called  on  an  array  that  contains  undefined  ele¬ 
ments. 


•  Allowing  the  index  of  the  unpacked  (packed)  array  to  be  exceeded  in  a  call  to 
pack  (unpack). 


6 


.6. 6. 4-9 


The  last  case  is  especially  nasty,  since  it  implies  that  the  array  bounds  can  be  exceeded, 
writirrg  to  an  area  of  memory  that  may  contain  other,  unrelated  information.  Since  these  errors 
are  not  detected,  spurious  program  behavior  can  result.  The  second  error  is  very  difficult  and 
expensive  to  detect,  but  the  other  two  errors  should  be  corrected. 

The  Mips  Pascal  compiler  allows  the  ord  function  to  be  applied  to  a  pointer. 


▼er 

ptr  :  ^integer; 
1  :  Integer; 
begin 

new (ptr) ; 
i  :■  ord (ptr) ; 
end. 


Again,  the  Mips  Pascal  corr  Her  is  generally  fairly  poor  at  checking  for  assignment  compati¬ 
bility.  This  is  another  example  of  the  failure  of  the  compiler  to  adhere  to  the  Pascal  typing 
rules. 
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6. 6. 6. 2- 4 

6. 6. 6. 2- 5 

6.6.6.2- 12 

6.6.6.2- 13 

6.6.6.2- 14 

6. 6. 6. 3- 3 

6. 6. 6. 3- 4 

6. 6. 6. 4- 5 

6. 6. 6.4- 6 

6. 6. 6. 4- 7 


6. 7. 2. 2- 8 

6. 1 .2.2- 9 

6.7.2.2- 10 

6.7.2.2- 11 

6.7.2.2- 12 

6.7.2.2- 13 

6.7.2.2- 16 

6.7.2.2- 19 


The  Mips  Pascal  compiler  does  not  check  that  the  parameters  to  arithmetic  functions  are  of  the 
correct  type: 
v&r 

a  ;  r«al; 
begin 

a :=aqr ( ' 4 ' ) ; 
and. 


The  Mips  Pascal  compiler  is  generally  fairly  poor  at  checking  for  assignment  and  range  com¬ 
patibility.  This  is  one  example  of  the  failure  of  the  compiler  to  adhere  to  the  Pascal  range  and 
typing  rules.  Other  failures  Include: 

•  Allowing  a  negative  number  to  be  passed  to  the  In  function. 

•  Allowing  a  negative  number  to  be  passed  to  the  sqrt  function. 

•  Allowing  an  undetected  (integer)  overflow  of  the  sqr  function. 

•  Allowing  a  number  larger  than  mexlnt  to  be  passed  to  trunc  or  round. 

•  Allowing  the  succ  function  on  the  last  value  of  an  ordinal  type. 

•  Allowing  the  pred  function  on  the  first  value  of  an  ordinal  type. 

•  Allowing  the  chr  function  to  be  used  on  ordinal  types  exceeding  the  range  of 
characte'S. 

In  all  of  these  cases,  no  compile  time  or  run-time  error  is  issued.  The  purpose  of  the  range 
and  type  checking  inherent  in  most  Pascal  compilers  is  to  detect  these  types  of  programming 
errors.  By  failing  to  detect  these  errors,  the  Mips  Pascal  compiler  is  allowing  many  potential 
bugs  to  creep  into  programs.  It  should  be  noted  that  these  errors  pass  through  even  when  the 
-C  switch  is  used  to  enable  run-time  range  checking. 

The  Mips  Pascal  compiler  does  not  issue  a  run-time  error  when  a  value  larger  than  maxlnt  is 
printed: 

v«r 

i;  Integer; 

function  mexie:  integer; 
begin 

x:b  maxint; 
end; 

begin 

i  :=  100; 

writeln('  NAXINT  4  100  s  ' , mexie-fi)  ; 
end. 


In  those  cases,  in  which  the  condition  can  be  detected  at  compile  time,  the  Mips  Pascal 
compiler  will  report  on  arithmetic  overflow.  There  appears  to  be  no  run-time  range  checking  on 
any  arithmetic  operations,  including: 

•  Allowing  a  negative  second  operartd  in  the  mod  operation. 

•  Allowirtg  a  floating-point  divide  by  zero. 

•  Allowing  run-time  overflow  on  addition. 

The  run-time  will  report  on  an  integer  division  or  modulo  by  zero,  but  it  does  so  by  issuing  a 
break  point  trap  and  dumping  core.  This  is  unacceptable. 
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6.7.2.2-18 


The  Mips  Pascal  compiler  allows  operands  of  other  than  real  or  integer  to  be  used  in  a  division 
operation: 


var 

C  f  ; 

r,  >:  raal; 


c  :=  'A' 
■  :=  1.5 
r  :=  c/m 


Since  Pascal  is  a  strongly  typed  language,  the  Mips  Pascal  compiler  should  check  for  such 
blatant  violations  of  type  compatibility.  Instead,  it  is  following  the  semantics  of  C,  and  consid¬ 
ering  a  character  type  to  be  the  same  as  an  integer  type.  This  is  clearly  an  error. 

6 . 7 . 2 . 4 - 9  The  Mips  Pascal  compiler  allows  a  non  ordinal  type  (i.e.,  strings  or  sets)  to  Oe  the  left  operand 

6.7.2.5- 10  of  the  in  operator: 


v«r 

m  :  aet  of  0 . . 10; 
begin 

•  :=  [3]; 

if  (•  in  [])  or  ('HZ'  in  [])  tben 
writeln('OOPS' ) ; 

•nd. 


6. 7. 2. 5-7 


This  is  another  example  of  the  Mips  Pascal  compiler  having  trouble  with  type  checking  and 
with  set  operations.  A  lot  of  work  needs  to  be  done  with  both  of  these  to  bring  the  compiler  up 
to  a  workable  level. 

The  Mips  Pascal  compiler  allows  equality  and  non-equality  between  different  pointer-types; 


type 

nettiral  s  0  .  .  10  ; 
one  s  ^integer; 
two  w  -'natural; 

var 

X :  one ; 
y:  two; 
begin 

new(x) ; 

X'  2; 

new (y) ; 
y'  :■  3; 

if  (x  <>  y)  or  not  (x  w  y)  then 
writaln ( ' YOW' ) ; 

end. 


Since  the  range  of  integers  expressed  by  type  integer  and  type  natural  are  different, 
comparisons  across  these  pointer  types  should  be  illegal.  However,  the  Mips  Pascal  compiler 
allows  them  which  introduces  a  serious  portability  problem  and  demonstrates  a  dangerous  lack 
of  type  checking. 
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€.9. 3. 1-2 

6. 9. 3. 1- 3 

6. 9. 3. 1- 7 

This  program  deviates  from  the  Starxfard  because  it  allows  output  of  a  non-positive  field  width: 

var 

£ : text ; 

1 : intagar ; 
bagin 

rawriba  (£) ; 

£or  i:=10  downto  -1  do 

writa(£, '  'REP=',i); 

and . 

The  Mips  Pascal  compiler  allows  this  illegal  program,  as  well  as  a  program  which  prints  a 
floating  point  number  with  a  zero  field  width  fraction,  to  compile  and  run.  While  this  is  a  small 
problem  on  the  Mips  (the  program  will  at  least  print  out  something),  it  presents  a  large  por¬ 
tability  problem. 

6.10-8 

This  program  deviates  from  the  Standard  because  the  program-parameter  £  has  been  sub¬ 
sequently  declared  as  a  function. 

prograa  t6pl0d8(£,  output); 

£unction  £:boole«n; 
begin 
£  :=  true 
end; 

begin 

writeln('OOPS' ) 
end. 

In  this  case,  the  type  of  £  is  initially  inferred  from  the  program  definition,  Howeve',  it  is  later 
defined  as  a  function.  When  it  is  referenced,  what  type  is  it?  In  this  case,  it  will  be  a  function, 
which  indicates  a  lack  of  the  appropriate  type  checking. 


8.3.  Bad  Code 

This  section  describes  samples  of  incorrect  code  being  generated  by  the  compiler  from  legal  Pascal 
source  code.  These  examples  are  the  nightmare  of  every  programmer  -  debugging  them  is  very 
difficult  because  as  far  as  the  programmer  can  tell,  the  source  code  is  perfectly  reasonable,  although 
the  output  of  the  compiler  does  not  exactly  correspond  to  the  input. 

These  deviations  represent  serious  problems  with  the  compiler.  In  fact,  there  may  be  more  ex¬ 
amples  than  the  ones  shown  here.  The  only  reason  these  were  found  is  because  of  specific  checks 
puf  in  the  test  programs  to  look  for  such  errors,  or  because  the  compiler  exhibits  different  behavior 
with  and  without  the  optimizer  engaged.  In  the  past,  we  have  been  able  to  generate  similar  errors  by 
writing  intentionally  noxious  code,  or  by  misusing  Peiscal.  The  primary  problem  lies  in  the  fact  that 
compilers  are  all  too  often  tested  only  on  good  code,  and  not  on  incorrect  code. 
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6. 1.1 -3  The  following  code  fragment  (where  stv  is  a  set  of  (0.  .9])  causes  an  infinite  loop  at 

optimization  level  2  or  above: 

•tv  :=  [1]; 

repeat 

with  pkr  do; 
xintil  (1)  in  stv; 


The  compiler  generates  the  following  code: 

«  140  stv  :=  [1]; 

li  $14,  1073741824 

sw  $14,  -16 ($6) 

t  141  repeat 

$62: 

#  142  with  pkr  do; 

t  143  until  (1)  in  stv; 

b  $62 

This  plainly  loops  forever.  The  reason  the  compiler  generates  this  code  is  not  obvious, 
although  examining  the  code  generated  at  optimization  level  1  gives  us  a  clue: 

#  140  stv  :=  [1]; 

li  $9,  1073741824 

Iw  $10,  36($sp) 

sw  $9,  -16($10) 

#  141  repeat 

$64: 

#  142  with  pkr  do; 

#  143  until  (1)  in  stv; 

Iw  $11,  36($sp) 

Iw  $12,  -16  ($11) 

sll  $13,  $12,  1 

bye  $13,  0,  $64 

At  this  lower  level  of  optimization,  the  compiler  is  performing  the  set-inclusion  test.  Unfor¬ 
tunately,  the  test  is  generated  incorrectly.  Rather  than  shifting  a  single  bit  to  the  left  (and  then 
comparing  the  result  with  the  set),  the  compiler  instead  is  shifting  the  set  left  and  comparing  it 
with  zero.  The  compiler  can  determine  this  as  a  compile-time  constant  (at  the  higher  optimiza¬ 
tion  level),  and  it  generates  an  infinite  loop. 

6 . 5 . 4  -1  The  Mips  Pascal  compiler  allows  a  pointer  which  is  undefined,  or  explicitly  initialized  to  nil ,  to 
€.5.4-2  be  dereferenced,  creating  a  core  dump; 
type 

rekord  >  record 

a  :  integer ; 
b  :  boolean 
end; 

var 

pcintar  :  ''rekord; 
begin 

pointer : wnil ; 
pointer ‘'.a  1; 

and. 


Even  with  value  tracking,  the  compiler  is  unable  to  detect  this  blatant  error.  In  the  more  subtle 
case  where  the  value  of  pointer  is  left  uninitialized,  the  compiler  exhibits  similar  behavior. 
This  is  arxfther  manifestation  of  the  lack  of  run-time  checking  by  the  compiler,  and  ft  should  be 
corrected.  At  the  very  least,  the  run-time  should  print  out  a  Pascal  run-time  error  message 
before  performing  the  core  dump. 
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6. 6. 5. 2- 8 

6.6.5.2- H 


The  following  program  dumps  core  on  execution: 
var 

£  :  £il*  o£  char; 
bagin 

get (f ) ; 

•nd. 


€ 

e 


The  reason  the  program  dumps  core  is  that  the  file  £  is  undefined  when  the  get  is  performed 
(i.e.,  no  reset  was  executed).  The  run-time  library  should  have  detected  this  fact  at  run-time. 
Instead,  it  rather  ungracefully  terminated  execution.  At  the  very  least,  a  run-time  error  mes¬ 
sage  should  have  been  issued.  The  program  will  also  dump  core  if  page  is  substituted  for 


6. 5. 3-4 


get. _  • 

The  following  program  dumps  core  on  execution: 


6. 5. 3-5 


type 

rakord  =  record 

a  ;  intagar; 
b  :  boolaan 


and; 


var 

ptr  :  ''rakord; 
bagin 

ptr :=nil; 
dispoaa (ptr) ; 
and. 


6. 7. 1-6 


The  reason  the  core  dump  occurs  is  that  ptr  is  nil.  The  run-time  should  test  for  illegal 
values  of  pointers  before  executing  the  dispose  operation.  This  program  will  also  dump  core  if 
ptr  is  left  urKfefined  (instead  of  being  expiicitiy  set  to  nil).  In  the  latter  case,  if  the  variable 
containing  the  pointer  is  uninitialized,  but  contains  (through  happenstance)  the  value  of  a 
different  pointer,  a  different  dynamic  element  could  be  disposed  of  -  a  highly  undesirous  effect. 
These  shortcomings  should  be  corrected  and  have  an  error  issued  from  the  run-time,  rather 
than  have  the  program  dump  core. 

The  following  example  works  correctly  without  the  optimizer  engaged  but  fails  when  optimiza¬ 
tion  level  2  is  used: 


n  ;=  2; 

i£  (l,2,«ucc(n)]=[l. .3] 
c : "C+l ; 


'than 


The  reasons  for  failure  result  from  compile-time  value  tracking  and  elimination  of  redundant 
code.  Specifically,  the  optimizer  krraws  the  values  of  all  of  the  conditional  expressions  at 
compile-time  and  simply  increments  c  for  each  case  where  the  conditional  is  true  (eliminating 
the  test  code  in  the  process).  Unfortunately,  the  optimizer  fails  to  track  and  recognize  the 
expression  [  l ,  2 ,  succ  (n } }  >  [  1 . .  3 )  as  being  true.  Examining  the  assembly  output  for  this 
fragment,  we  see  that  this  is  another  manifestation  of  the  bad  coda  generated  for  sets: 
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(continued)  I  #26  if  [1 , 2, ■uoc(n) ]= [1 . .3]  than 


Iw 

$13, 

36 (Sop) 

addu 

$14, 

$13,  1 

addu 

$15, 

$14,  -96 

■  Itu 

$24, 

$15,  32 

not 

$25, 

$14 

•11 

$8, 

$24,  $25 

•ddu 

$9, 

$14,  -64 

•Itu 

$10, 

$9,  32 

•11 

$11, 

$10,  $25 

or 

$12, 

$8,  $11 

•ddu 

$13, 

$14,  -32 

•  Itu 

$15, 

$13,  32 

•11 

$24, 

$15,  $25 

or 

$9, 

$12,  $24 

•Itu 

$10, 

$14,  32 

•11 

$8, 

$10,  $25 

or 

$11, 

$e,  1610612736 

xor 

$13, 

$11,  1879048192 

or 

$15, 

$9,  $13 

bn« 

$15, 

0,  $32 

.  loc 

2  27 

1  27 

c  :=c-H; 

Iw 

$12, 

32($op) 

addu 

$24, 

$12,  1 

•w 

$24, 

32 (Sop) 

$32: 

The  two  constants  1610612736  and  1879048192  are  0x60000000  and  0x70000000,  respec¬ 
tively  (which  correspond  to  the  sets  (1..2]  and  [1..3],  respectively).  The  optimizer  is  perlorming 
a  correct  optimization,  given  an  incorrect  source  of  assembly  instructions. 

6.7 ,2.2-4  The  compiler  issues  the  error  "uopt:  Warning: multiplication  overflow" on  the  following  example 
when  the  optimizer  is  enabled  but  issues  no  error  if  it  is  disabled.  In  either  case,  no  run-time 
error  is  issued. 

inax:s-  (-najcint) ; 
if  odd (maxint)  than 

i :  a  (max- ( (max  div  2)4-1))  *2 


The  problem  here  is  that  the  optimizer  is  of  reorganizing  the  arithmetic  expression  (while  no 
such  reorganization  is  performed  without  the  optimizer).  This  rearrangement  causes  the 
arithmetic  overflow.  Since  the  expression  was  parenthesized  specifically  to  avoid  the  math¬ 
ematical  overflow,  we  believe  that  the  compiler  is  in  error. 

6.7.2. 5-2  This  program  fragment  does  not  print  TRUK  as  it  should: 

b  ;«  [2,3,4]; 

o  :a  3; 

if  (o  in  b)  tban 
wxltaln('TRUB') ; 


This  is  arwjther  example  of  the  in  operator  generating  bad  code  and  having  it  optimized  out  to 
nothingness.  When  expressions  such  as  boc  or  b<>c  are  used,  the  compiler  sometimes 
also  functions  incorrectly.  The  in  operator  (as  well  as  the  equality  operator  from  Example 
6.7. 1-6)  seem  to  be  failing. 
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6 . 6 . 1-6  The  Mips  Pascal  compiler  allows  a  goto  into  a  with  statement,  with  disastrous  results.  The 
€  .8 .1-7  following  program  dumps  core: 


type 

r*c  =  record 

y:  integer ; 
•nd; 

ptrec  =  ^rec; 

ver 

x:  ptrec; 
done :  boolean ; 
begin 

new (x) ; 
x^'.y  :=  100; 
done  :=  felee; 
if  done  then 
with  x''  do 
1: 

begin 

writeln (y) ; 
y  ;=  y  +  1 
end; 

if  not  done  then 
begin 

done  :=  true; 

goto  1 

end 

end. 


6.10-10 


In  this  example,  the  placement  of  the  label  is  legal,  in  that  it  references  a  simple  statement. 
The  goto  is  illegal,  however,  in  that  it  references  an  illegal  target.  In  this  case,  the  initialization 
code  for  the  with  statement  is  skipped,  and  an  indirection  through  an  uninitialized  register  is 
performed  in  accessing  x‘'.y.  This  "feature*  should  be  removed  from  the  Mips  Pascal  com¬ 
piler,  arxl  only  the  legal  set  of  gotos  should  be  allowed.  In  general,  the  Mips  Pascal  compiler 
is  implementing  the  semantics  of  c  in  allowing  this  feature. 

The  following  program  causes  a  core  dump: 


▼xr 

c  :  ch«r; 
begin 

writeln ('Start' ); 
reset (output) ; 
resd(output, c) ; 
end. 


This  program  attempts  to  reuse  output  as  a  regular  file  that  can  be  read  from.  This  attempt  is 
perfectly  legal  according  to  the  Standard  because  it  is  implementation-defined  as  to  whether 
output  actually  goes  to  a  terminal  (all  K  need  [must]  do  is  treat  output  as  an  ordinary  file). 
The  Mips  Pascal  compiler  implementation  of  output  classes  it  as  the  Unix  stdout  file  using 
the  starxfard  Unix  file  corrventions.  This  breaks  the  Pascal  standard.  While  few  users  may 
take  advantage  of  this  aspect  of  the  Standard,  there  are  other  ramifications  that  must  be 
considered. 
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-none- 


The  following  code  generates  the  error  from  the  linker:  "Undefined:  write_set"-. 
var  a  :  aat  of  0..10; 


bagin 

*  :=  [1,3,5]; 
wrltaln (a) ; 
and. 


The  implementors  of  the  Mips  Pascal  compiler  library  functions  have  either  not  implemented 
the  write_set  Operation,  or  they  have  failed  to  include  it  in  the  distribution.  In  any  event, 
write_set  is  not  in  the  library  file,  and  programs  which  attempt  to  print  out  the  contents  of 
sets  will  fail  to  compile  successfully. 


8.4.  Mips  Extensions  to  Standard  Pascal 

According  to  Mips,  the  Mips  Pascal  compiler  contains  the  following  extensions  to  the  Pascal  Stan¬ 
dard: 

•  Allows  the  use  of  underscores  ( _ )  in  variable  names. 

•  Prints  alphabetic  labels  (see  test  6. 1.6-6  in  Section  8.1). 

•  Allows  numbers  in  a  non-decimal  radix.  Any  radix  between  base  2  and  base  36  is 
permitted,  write  and  writein  also  support  arbitrary  radix  output. 

•  Predefines  three  extra  data  types  in  the  compiler; 

•  double  -  double  precision  floating-point 

•  cardinal  -  unsigned  integers  in  the  range  of  0.. 4294967295 

•  pointer  -  a  pointer  to  any  data  type 

•  The  Mips  Pascal  compiler  always  does  short-circuit  boolean  evaluation  (this  is  a  per¬ 
mitted  extension,  but  dependency  on  it  guarantees  non-portability). 

•  Automatically  pads  strings  with  trailing  spaces  to  fill  them  <to  the  required  length  (see 
test  6.7.1-10  in  section  8.1). 

•  Allows  non-ASCII  characters  in  strings,  following  the  Unix  convention  of  escape  charac¬ 
ter  sequences. 

•  Permits  constant  expressions  in  type  or  array-bound  definitions.  It  also  supports  the 
following  additional  built-in  functions: 

•  bitand  -  bitwise  and 
•bitor- bitwise  or 

•  bitxor  -  bitwise  xor 

•  ishift  -  logical  left  shift 

•  rshift  -  logical  light  shift 

•  ibound  -  the  lower  bound  of  an  array  (this  is  odd  in  that  this  facility  is  provided 
but  conformant  arrays  parameters  are  not) 
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•  hbound  -  the  higher  bound  of  an  array  (this  is  odd  in  that  this  facility  is  provided 
but  conformant  arrays  parameters  are  not) 

•  first  -  the  first  value  of  a  scalar  type 

•  last  -  the  last  value  of  a  scalar  type 

•  sizeof  -  the  size  (in  bytes)  of  a  data  type 

•  luiw  the  minimum  of  a  set  of  scalars 

•  max  -  the  maximum  of  a  set  of  scalars 

•  assert  -  evaluates  a  boolean  expression  and  prints  a  run-time  error  message 

•  date  -  the  current  date  in  string  form 

•  time  -  the  current  time  of  day  in  string  form 

•  clock  -  the  number  of  milliseconds  of  CPU  time  used  by  the  process 

•  argv  -  returns  a  specified  program  argument  as  passed  in  from  the  shell 
Permits  ranges  as  case  statement  constants  (see  test  6.8.3.5-7  in  Section  8.1). 

Includes  an  otherwise  clause  in  the  case  statement. 

Allows  a  return  Statement  to  exit  a  subroutine  or  function. 

Permits  a  continue  and  a  break  statement  vwth  semantics  similar  to  the  c  version. 

Adds  the  concept  of  shared  variables  and  the  keyword  external  to  facilitate  separate 
compilation. 

Adds  variables  to  have  an  initialization  clause  along  with  their  declaration  part.  This  is 
especially  useful  for  initializing  arrays. 

Relaxes  the  declaration  ordering  rules.  See  tests  6.2.1-8,  through  6.2.1  -9,  and  -10  in 
the  section  on  portability  (Section  8.1). 

Allows  the  rewrite  and  reset  routines  to  take  an  optional  filename  parameter. 

Allows  the  write  and  writein  routines  to  work  on  enumerated  types. 

Employs  a  preprocessor  (namely  cpp)  before  compilation. 


8.5.  Local  Conclusions  * 

In  spite  of  the  large  number  of  specific  deviations,  the  Mips  Pascal  compiler  is  a  fairly  reasonable 
compiler  which  generates  very  efficient  code.  The  robustness  of  the  compiler  is,  however,  question¬ 
able  at  best.  Even  with  the  compiler  option  -c,  which,  according  to  the  on-line  manual  page  entry  for 
the  Pascal  compiler  pc,  is  supposed  to  'generate  code  for  run-Ume  range  checking, '  the  range  and  • 

type  checking  of  the  compiler  are  fairly  specious  and  need  to  be  made  much  more  robust. 

It  is  possible  to  assign  numbers  out  of  their  range,  to  assign  one  set  to  another  which  has  no 

overlapping  objects,  to  generate  (without  detection)  arithmetic  overflow  and  underflow,  to  index 

through  a  deleted  pointer,  to  read  past  the  end  of  a  file,  and  so  on.  In  short,  the  Mips  Pascal  ^ 

compiler  implements  the  simple  Unix  and  c  model  of  a  programming  language. 
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We  would  not  dwell  so  much  on  the  failings  of  the  Pascal  compiler  were  it  not  for  one  simple  fact: 
the  Mips  common  code  generator,  optimizer,  assembler/reorganizer,  and  the  Mips  Pascal  compiler 
itself  are  all  written  in  this  same  version  of  Pascal.  Thus,  since  the  compiler  does  not  check  for 
pointer  validity,  range  overflow,  and  file  validity,  unless  the  programmer  performs  these  checks  ex¬ 
plicitly,  it  is  entirely  possible  that  all  manner  of  bugs  will  be  lurking  in  the  depths  of  these  programs. 
Mips  Incorporated  has  rep>eatedly  asserted  that  this  is  not  true,  but  we  do  not  agree.  The  tests  that 
they  have  run  on  theii  compilers  are,  by  their  own  admission,  a  set  of  programs  which  are  known  to 
function  correctly.®®  These  programs  will  only  detect  that  the  compiler  and  utilities  function  correctiy 
given  correct  input.  They  in  no  way  test  the  compilers’  behavior  given  incorrect,  or  for  that  matter, 
merely  different  input.  We  are  willing  to  give  long  odds  that  adding  the  full  complement  of  range  and 
bounds-checking  code  to  the  Pascal  compiler  will  likely  turn  up  at  least  one  hitherto  undetected 
violation  of  range  or  boundary  limits. 

There  are  numerous  examples  in  the  validation  suite  of  the  compiler  or  the  run-time  crashing  while 
executing  suspicious  (or  in  some  cases,  correct)  Pascal  source  code.  While  it  is  unreasonable  for 
the  run-time  to  crash,  it  is  unacceptable  for  the  compiler  to  ever  crash,  no  matter  how  unreasonable 
the  input.  Regrettably,  the  Mips  Pascal  compiler  could  stand  a  bit  of  strengthening  in  this  area. 


*^Examples  are:  the  Unix  utility  set,  their  own  compilers,  the  run-time  libraries,  benchmarks,  etc. 


CMU/SEI-e7-TR-29 


135 


9.  Unexpected  Program  Behavior 

Figure  9-1  shows  a  simple  assembly  program  that  has  four  load  instructions  from  two  different  ad¬ 
dresses.  Both  of  the  addresses  are  in  the  sdata  psect,  and  thus  all  addresses  are  supposed  to  be 
gp-relative. 

. sdata 
.align  2 
x:  .word  1 

.text 

L: 

Iw  $2,x 
la  $3,x 
Iw  $4,y 
la  $5,y 

. sdata 
. align  2 
y:  .word  1 

Figure  9-1 :  Assembly  Code  that  Triggers  gp-Relative  Bug 

The  Mips  assembler/reorganizer  is  supposed  to  take  assembly  language  programs  and  translate 
them  into  Mips  M/500  native  instructions,  potentially  changing  some  instmction  sequences  into 
others.  One  of  the  instruction  sequences  that  it  is  supposed  lo  modify  is  the  load-class  instruction.  It 
the  source  of  the  load  is  at  a  gp-relative  address,  then  the  assembler  reorganizer  should  make  the 
load  be  gp-relative.  If  not,  then  the  assembler/reorganizer  should  make  the  load  be  from  a  32-bit 
address.  The  advantage  to  the  gp-relative  load  is  that  it  requires  only  one  instruction,  while  the 
32-bit  address  load  requires  two. 

In  the  source  code  in  figure  9-1,  all  of  the  address  references  are  properly  gp-relative,  and  each 
should  be  translated  into  a  single  Mips  M/500  instruction.  However,  as  can  be  seen  in  figure  9-2, 
this  is  not  the  case. 


0x0: 

8f828010 

Iw 

vO,  -32752 (gp) 

0x4: 

27838010 

addiu 

vl,gp, -32752 

0x8: 

3c010000 

lui 

at , 0x0 

Oxc: 

8c24001c 

Iw 

a0,28(at) 

0x10: 

2425001c 

addlu 

al,at,28 

0x14: 

00000000 

nop 

Figure  9-2:  Mips  M;500  Code  from  Figure  9-1 

Both  of  the  references  to  the  variable  x  are  encoded  as  a  gp-relative  reference,  whereas  both 
references  to  the  variable  y  are  not.  The  only  difference  between  x  and  y  is  that  y  is  a  forward 
reference.  This  behavior  is  reminiscent  of  an  early  1  or  1.5  pass  assembler  and  should  not  be 
present  in  a  modem  2  pass  assembler. 
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10.  Conclusions 

The  RISC  Evaluation  Project  set  out  to  answer  two  questions; 

1.  Taking  hardware  and  system  software  together,  is  a  machine  built  using  RISC  prin¬ 
ciples  a  feasible  competitor  to  a  CISC  machine? 

2.  How  well  do  the  actual  hardware  and  software  of  a  specific  RISC  system  (in  this  case, 
the  Mips  M/500)  compare  with  those  of  a  specific  CISC  system  (in  this  case,  the  Vax)? 

The  first  question  can  be  answered  only  qualitatively,  in  terms  of  one’s  instinct  or  opinion.  The 
second  question  can  be  addressed  quantitatively  by  analyses  of  benchmarks,  instruction-set  usage 
patterns,  and  other  data. 

This  report  documents  in  detail  our  answer  to  the  second  question,  presenting  both  the  data  them¬ 
selves  and,  where  appropriate,  the  evaluation  methods  we  employed.  Our  conclusions,  culled  from 
the  body  of  the  report,  are:®® 

•  The  particular  machine  studied,  the  Mips  M/500,  conforms  closely  to  the  CORE  iSA 
definition  and  can  fairly  be  classed  as  a  RISC  class  machine. 

•  The  overall  performance  of  the  hardware  is  very  impressive,  about  8  million  machine 
instructions  per  second. 

•  When  code  in  high-level  languages  is  run,  this  hardware  performance  yields  objective 
benchmark  and  application  performance  of  about  five  times  that  of  a  Vax-1  1/780  run¬ 
ning  Unix  4.3  BSD,  this  being  the  unofficial  "one  MIP*  machine. 

•  This  level  of  performance  was  consistent  across  a  wide  variety  of  benchmarks  and 
applications.  Although  we  stress  that  benchmark  statistics  without  the  accompanying 
evaluation  are  next  to  useless,  we  also  observed  that  the  Mips  M/500  benchmarked  at 
2408  Whetstones  (Fortran  single  precision),  and  at  14184  Dhrystones  (register  and 
non-register). 

In  attempting  to  answer  the  first  question,  we  made  the  following  observations,  which  we  again 
emphasize  are  qualitative  rather  than  quantitative: 

•  Hand  coding  of  small  benchmarks  can  still  provide  major  improvement  over  compiler¬ 
generated  code.  Nevertheless,  compilers  for  the  RISC  machine  performed,  overall, 
much  better  than  those  for  the  CISC  machine. 

•  The  compiler-generated  code  shows  substantially  more  effective  usage  of  the  RISC 
instructions  and  addressing  modes,  with  no  serious  inefficiencies  caused  by  omitted 
instructions  and  addressing  modes.  This  finding,  especially,  bears  out  the  claims  made 
on  behalf  of  RISC  machines. 

•  Targeting  a  compiler  to  a  RISC  machine  does  not  seem  much  harder  than  targeting  one 
to  a  CISC  machine.  Different  tasks  have  to  be  done,  but  the  overall  amount  of  work  is 
about  the  same.  However,  we  believe  that  the  compiler  should  also  perform  any  object- 
code  reorganization  that  may  be  required,  rather  than  leaving  this  to  a  separate  pro¬ 
gram. 


**Mor«  ctotaiM  conclusion*  can  bo  found  in  tho  soctions  ontKlod  Local  Conclusions.  Theso  aro  sections  3S  (assembly 
language  reorganization),  4.1 .6,  4.2.3,  4.3.3,  and  4.4.2  (benchmaiking),  6.2.8,  and  6.3.4  (c.;mpiler  utilization  of  the  instruction 
set),  8.5  (Pascal  compiler  conformance),  and  Appertdix  Section  C.7  (conformance  to  the  CORE  ISA).  The  reader  is  urged  to 
read  these  sections  for  more  information. 
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•  Fewer  actual  instructions  are  required  by  a  CISC  machine  to  perform  the  same  function 
as  a  RISC  machine  -  an  expected  phenomenon.  The  ratio  of  the  number  of  bytes 
required  to  represent  these  instructions  (a  much  more  valid  measure)  is  far  closer  to  one 
than  is  the  ratio  of  instruction  counts.  With  memory  costs  decreasing  as  they  are,  the 
greater  processing  power  of  the  RISC  architecture  far  outweighs  the  slightly  increased 
memory  use. 

We  also  formed  some  conclusions  about  the  assessment  process  itself,  which  are  perhaps  of  gen¬ 
eral  applicability: 

•  It  is  not  easy  to  disentangle  the  effects  of  hardware,  operating  system,  file  system, 
compilers,  and  languages.  The  investigator  must  be  prepared  to  recognize  tiny 
anomalies,  track  down  vague  dues,  run  down  blind  alleys,  and  perform  a  large  number 
of  experiments  differing  only  in  minute  detail. 

•  One  must  be  very  specific  about  what  one  is  measuring.  The  same  benchmark  in  two 
languages  may  yield  quite  different  numbers;  the  same  program  run  twice  may  give 
different  timings;  two  compilers  for  the  same  language  may  show  radically  different  code 
patterns  for  the  same  idioms. 

•  The  purpose  of  computing  is  insight,  not  numbers.  No  datum  is  useful  unless  it  can  be 
explained,  no  explanation  is  useful  unless  it  sen/es  to  illuminate  an  issue  or  progress  an 
argument.  If  there  is  a  "bottom  line"  in  benchmarking,  it  is  that  you  must  understand 
what  you  are  doing  and  why  you  are  doing  it. 

Finally,  it  seems  appropriate  to  reiterate  the  main  conclusion  of  this  investigation: 

There  may  not  always  be  a  right  choice  and  a  wrong  choice  in  the 
RISC  versus  CISC  debate.  However,  in  ail  the  areas  we  examined, 
the  Ff/SC  design  was  neverthe  wrong  architectural  choice. 
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Appendix  A:  Overview  of  MIPS  Instruction  Set  Translation 

Table  A-1  lists  the  coirespondences  between  the  Mips  high-level  instruction  set  names  for  the 
registers  and  the  Mips  M/500  machine  instruction  equivalents.  Both  names  can  be  accessed  by  the 
user  (see  Chapters  1  and  7  of  "The  Mips  Assembly  Language  Programmer’s  Gu/de"  [MIPS  86a]  for 
more  details). 


Register  Name(s) 

Equivalent  Name(s) 

Register  Name(s) 

Equivalent  Name(s) 

so 

zero 

$16 

sO 

Sat 

at 

$17 

sl 

S2 

vO 

$18 

s2 

S3 

vl 

$19 

s3 

S4 

aO 

$20 

s4 

S5 

al 

$21 

s5 

S6 

a2 

$22 

s€ 

S7 

a3 

$23 

s7 

S8 

to 

$24 

t8 

S9 

tl 

$25 

t9 

SIO 

t2 

$26  or  $kt0 

kO 

Sll 

t3 

$27  or  $ktl 

kl 

$12 

t4 

$28  or  $gp 

gp 

S13 

t5 

$29  or  $sp 

sp 

$14 

t6 

$30  or  Sfp 

fp  or  58 

S15 

t7 

$31 

ra 

Table  A-1 :  Mi-  s  M/500  High-  and  Low-Level  Equivalent  Register  Names 


What  follows  is  a  table  of  all  of  the  Mips  assembly  language  instructions  followed  by  the  correspond¬ 
ing  Mips  M/500  native  instructions  that  are  generated  by  the  assembler  reorganizer.  We  have  at¬ 
tempted  to  cover  all  of  the  possible  operand  combinations  allowed  by  each  instruction.  These 
modes  are  typically  two  operand  (dest/srcl,  src2),  three  operand  (dest,  srcl,  src2),  three  operand 
with  one  immediate  value  (including  a  small  integer,  a  large  integer,  and  a  large  integer  power  of 
two),  and  three  operand  with  one  zero  value  (expressed  as  both  an  immediate  value  and  as  the  zero 
register). 

In  all  cases,  the  machine  language  output  has  been  assembled  relative  to  a  base  address  of  0,  so 
that  all  branches  are  based  at  the  beginning  of  the  code  fragment.  Each  instruction  takes  up  four 
bytes,  so  a  branch  to  address  0x1  c  will  transfer  to  the  eighth  instruction  (counting  from  zero).  Also, 
the  large  constant  value  2097152  is  0x20000  (a  convenient  large  power  of  two  that  exceeds  the 
immediate  operand  size  of  the  Mips  M/500).  The  constant  values  greater  than  20971 52  are  used  as 
non-even-multiples  of  two  for  comparison  purposes. 
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The  main  table  is  designed  to  parallel  the  instruction  order  listed  in  Chapter  5  of  the  Mips  Assembly  # 

Language  Programmer’s  Guide  [MIPS  86a].  An  alphabetic  cross  reference  can  be  found  in  table  A-2 
at  the  end  of  this  appendix  section. 


Assembler  Input  Machine  Language  Output  Comments 


la  $4,  ($5) 

addiu  a0,al,0 

All  of  the  addressing  modes  are  available 
to  all  of  the  load  instructions,  but  some 
do  not  make  sense,  in  which  case,  load¬ 
ing  the  address  of  a  indexed  register  is  0 

the  same  as  taking  the  value  of  the 
register. 

la  $4,24 

li  a0,24 

la  $4,2097156 

lui  at, 0x20 

addiu  aO , at , 4 

Since  the  Mips  M/500  can  only  store  a 

16-bit  address  in  a  32-bi1  instruction,  the  _ 

upper  16-bits  of  an  address  must  be  ^ 

loaded  in  a  separate  instruction  (the 
lui). 

la  $4,24($5) 

addiu  a0,al,24 

In  this  case,  the  address  of  a  based  ad¬ 
dress  Is  the  value  in  the  base  register 
plus  the  value  of  the  offest.  ^ 

la  $4, 2097156 ($5) 

lui  at , 0x20 

addu  at , at , al 

addiu  aO , at , 4 

In  this  case,  too,  the  address  of  a  based 
address  is  the  value  in  the  base  register 
plus  the  value  of  the  offset.  However,  the 
addition  must  be  done  in  two  stages,  be¬ 
cause  of  to  the  limitation  of  the  16-bif  im¬ 
mediate  field.  ^ 

la  $4, BEGIN 

lui  at , 0 

addiu  a0,at,0 

Loading  the  address  of  a  global  variable 

(that  is  relocatable)  requires  that  the 

upper  16-bits  always  be  loaded,  with  the 

linker  filling  in  the  correct  value  (since  it 

cannot  be  determined  at  assembly  time 

what  the  value  of  the  upper  16  bits  will  0 

be). 

la  $4,BEGIN+24 

liii  at ,  0 

addiu  a0,at,24 

la  $4,BEGIN($5) 

lui  at , 0 

addu  at , at , al 

addiu  a0,at,0 

What  appears  to  be  a  superfluous  addiu 
in  this  sequence  is  actually  needed.  The  ^ 

sequence  of  events  here  is  to  load  the  * 

upper  16-bits  of  the  address  of  BEGIN, 
then  add  in  (i.e.,  index  off  of)  register  al, 
then  add  in  the  lower  16-bits  of  the  ad¬ 
dress  of  BEGIN  (which  will  be  relocated 
to  some  other  address  at  link  time). 

la  $4, BEGIN+24  ($5) 

lui  at,0 

addu  at , at , al 

addiu  a0,at,24 

# 
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Assembler  Input 


Machine  Language  Output 


Comments 


$4,2097156 


$4,24 ($5) 


$4, 2097156 ($5)  |l 


$4,BEGIN+24  lui 

lb 


$4, BEGIN  ($5) 


$4, BEGIN+24  ($5)  lui 

addu 

lb 

nop 


$4, 24 


$4,24 ($5) 


$4,2097156 ($5) 


$4, BEGIN 


$4, BEGIN+24 


$4, BEGIN ($5) 


$4, BEGIN+24 ($5)  lui 
addu 


a0,0(al) 


aO, 24 (zero) 


at, 0x20 
aO, 4 (at) 


a0,24 (al) 


An  absolute  address  is  expressed  as  a 
based  address  off  of  the  zero  register. 


When  the  absolute  address  exceeds  1 6- 
bits,  it  is  calculated  in  two  stages,  using 
at  as  a  temporary  register. 


at,  0 
aO,  0 (at) 


at,  0 

a0,24(at) 


Relocatable  addresses  are  unknown  at 
assembly  time,  so  their  full  32  bits  must 
be  planned  for  by  the  assembler. 


a+ ,  0 

at,at,al 

a0,24(at) 


a0,0(al) 


aO, 24 (zero) 


The  Ibu  instruction  follows  a  format 
identical  to  the  lb  instruction. 


a0,24(al) 


at, 0x20 
at , at , al 
•0,4 (at) 


at,  0 
*0,0 (at) 


at,  0 

*0,24 (at) 


*t,0 

at ,  at ,  *1 
*0, 0(at) 


*t,0 

at ,  at ,  al 
a0,24(at) 
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Assembler  Input  Machine  Language  Output  Comments 


Ih  $4, ($5) 

Ih  a0,0<al) 

nop 

The  Ih  instruction  follows  a  format  iden¬ 
tical  to  the  lb  instruction. 

Ih  $4,24 

Ih  aO, 24 (zero) 

Ih  $4,2097156 

lui  at, 0x20 

Ih  a0,4(at) 

nop 

• 

Ih  $4,24{$5) 

Ih  a0,24(al) 

Ih  $4, 2097156  ($5) 

lui  at , 0x20 

addu  at , at , al 

Ih  a0,4(at) 

• 

Ih  $4, BEGIN 

lui  at , 0 

Ih  a0,0(at) 

Ih  $4,BEGIN+24 

lui  at ,  0 

Ih  a0,24(at) 

Ih  $4,BEGIN($5) 

• 

Ih  $4,BEGIN+24  ($5) 

lui  at , 0 

addu  at , at , al 

Ih  a0,24{at) 

nop 

• 

Ihu  $4, ($5) 

Ihu  a0,0(al) 

nop 

The  Ihu  instruction  follows  a  format 
identical  to  the  lb  instruction. 

Ihu  $4,24 

Ihu  aO , 24 (zero) 

Ihu  $4,2097156 

lui  at, 0x20 

Ihu  a0,4(at) 

nop 

• 

Ihu  $4,24($5) 

Ihu  a0,24(al) 

Ihu  $4,2097156($5) 

lui  at, 0x20 

addu  at , at , al 

Ihu  a0,4(at) 

• 

Ihu  $4, BEGIN 

lui  at , 0 

Ihu  a0,0(at) 

Ihu  $4,BEGIN+24 

lui  at , 0 

Ihu  a0,24(at) 

Ihu  $4,BEGIN($5) 

lui  at , 0 

addu  at,at,al 

Ihu  a0,0(at) 

• 

Ihu  $4, BEGIN+24 ($5) 

lui  at , 0 

addu  at,at,al 

Ihu  a0,24(at) 

nop 

• 

Iw  $4,  ($5) 

l«r  a0,0(al) 

nop 

The  Iw  instruction  follows  a  format  iden¬ 
tical  to  the  lb  instruction. 

A 
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Iw 

Ik 


Lw 

Iw 


lw 


lw 


lw 


lw 


Iwl 


Iwl 

Iwl 


Iwl 

Iwl 


Iwl 


Iwl 


Iwl 


Iwl 


Iwr 


Iwr 


Assembler  Input  Machine  Language  Output  Comments 


$4,24 

lw  aO, 24 (zero) 

$4,2097156 

lui  at, 0x20 
lw  a0,4(at} 
nop 

$4,24  ($5) 

lw  a0,24(al) 

$4,2097156  ($5) 

lui  at, 0x20 

addu  at , at , al 

lw  a0,4(at) 

$4, BEGIN 
$4, BEGIN+24 
$4, BEGIN  ($5) 

$4, BEGIN+24  ($5) 


lui  at , 0 

lw  a0,0(at) 

lui  at , 0 

lw  a0,24(at) 

lui 

at,  0 

addu 

at, at, al 

lw 

aO, 24 (at) 

nop 

$4, ($5) 

Iwl 

nop 

$4,24 

Iwl 

$4, 2097156 

lui 

Iwl 

nop 

$4,24  ($5) 

Iwl 

$4,2097156  ($5) 

lui 

addu 

Iwl 

$4, BEGIN 

lui 

Iwl 

$4, BEGIN+24 

lui 

Iwl 

$4,BEGIN($5) 

lui 

addu 

Iwl 

$4, BEGIN+24 ($5) 

lui 

addu 

Iwl 

nop 

$4, ($5) 

Iwr 

aO,al, 0 


aO, 

zero, 24 

at. 

0x20 

aO, 

at. 

4 

aO, 

al. 

24 

at. 

0x20 

at. 

at. 

al 

aO, 

at. 

4 

at. 

0 

aO, 

at. 

0 

«t. 

0 

•0, 

ft 

24 

*t. 

0 

at. 

at. 

al 

•0, 

at. 

0 

at. 

0 

•t. 

at. 

al 

•0, 

at. 

24 

aO,al, 0 


The  Iwl  instruction  follows  a  format 
identical  to  the  lb  instruction. 


The  Iwr  instruction  follows  a  format 
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Assembler  Input 


Machine  Language  Output 


Comments 


Iwr 


Iwi 


Iwr 


Iwr 


Iwr 


Iwr 


Iwr 


Id 


Ld 


Id 


Id 


Id 


Id 


Id 


$4,2097156 


$4, 24  ($5) 


$4, 2097156 ($5) 


$4, BEGIN 


$4, BEGIN+24 


$4, BEGIN ($5) 


$4, BEGIN+24 ($5) 


$4,  ($5) 


$4,24 


$4, 2097160 


$4,24  ($5) 


$4,2097156 ($5) 


$4, BEGIN 


$4, BEGIN+24 


lui 

Iwr 

nop 


at , 0x20 
aO, at, 4 


Iwr 


a0,al,24 


lui 

addu 

Iwr 


at , 0x20 
at ,  at ,  al 
aO , at , 4 


lui 

Iwr 


at,  0 
aO, at , 0 


lui 

Iwr 


at,  0 
a0,at,24 


lui 

addu 

Iwr 


at ,  0 

at ,  at ,  al 
aO,at, 0 


lui 

addu 

Iwr 

nop 


at,  0 

at , at , al 
aO, at , 24 


Iw 

Iw 


a0,0(al) 

al,4(al) 


lui 

Iw 

Iw 

nop 


at, 0x20 
al, 12 (at) 
aO, 8  (at) 


Iw 

Iw 


a0,24(al} 

al,28(al) 


lui 

addu 

Iw 

Iw 

nop 


at , 0x20 
at ,  at ,  al 
al, 8 (at) 
a0,4(at) 


lui 

Iw 

Iw 

nop 


at ,  0 
al,4(at) 
aO,  0 (at) 


Itii 

Iw 

Iw 

nop 


at ,  0 

al,28(at) 

a0,24(at) 


The  Id  instruction  does  not  exist  on  the 
Mips  M/500  and  is  implemented  with  two 
Iw  instructions. 


The  assembler  generates  no  code  (or 
this  instruction,  and  issues  no  warning 
message.  We  cannot  find  any  reason 
why  this  should  be  the  case. 


The  implementation  of  this  instruction  is 
clever.  Since  the  full  32  bits  of  the  abso¬ 
lute  address  need  to  be  loaded,  the  as¬ 
sembler  reorganizer  loads  the  hIgh-order 
16  bits  with  the  lui  instruction,  and  this 
accounts  for  the  low-order  16  bits  in  the 
offsets  presented  to  the  iw  instructions. 
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Id 

$4, BEGIN  ($5) 

lui 

at,  0 

addu 

at , at ,  al 

Iw 

al, 4 (at) 

Iw 

aO, 0 (at) 

nop 

Id 

$4, BEGIN+24  ($5) 

lui 

at,  0 

addu 

at , at , al 

Iw 

al,28(at) 

Iw 

a0,24(at) 

nop 

ulh 

$4,  ($5) 

lb 

a0,0(al) 

The  ulh  instruction  loads  a  halfword  ir- 

Ibu 

at,l(al) 

respective  of  the  alignment  of  the  source 

sll 

aO, aO, 8 

address.  It  must  therefore  load  the  bytes 

or 

aO,aO,at 

of  the  halfword  independently  and  shift- 
and-or  the  results  to  the  destination. 
Thus  a  simple  Mips  instruction  is  ex¬ 
panded  to  400%  of  its  original  size. 

ulh 

$4,24 

lb 

aO, 24 (zero) 

This  is  suboptimal  code,  since  the  as- 

Ibu 

at, 25 (zero) 

sembler  can  determine  that  the  absolute 

sll 

aO, aO, 8 

expression  24  is  halfword-aligned.  This 

or 

aO,aO,at 

should  simply  emit  a  ih  instruction. 

ulh 

$4,2097156 

lui 

at, 0x20 

Suboptimal  code  (see  above) 

addiu 

at , at , 4 

lb 

aO,  0 (at) 

Ibu 

at,  1 (at) 

sll 

aO, aO, 8 

or 

aO, aO,  at 

ulh 

$4,24  ($5) 

lb 

a0,24(al) 

Ibu 

at,25(al) 

sll 

a0,a0, 8 

or 

a0,a0, at 

ulh 

$4, 2097156 ($5) 

lui 

at, 0x20 

addu 

at , at , al 

addiu 

at , at , 4 

lb 

a0,0(at) 

Ibu 

at,l(at) 

sll 

a0,a0, 8 

or 

aO , aO , at 

ulh 

$4, BEGIN 

lui 

at,  0 

lb 

a0,0(at) 

Ibu 

at,l(at) 

sll 

a0,a0, 8 

or 

a0,a0,at 

ulh 

$4, BEGIN+24 

lui 

at,  0 

lb 

a0,24(at) 

Ibu 

at,  25  (at) 

sll 

a0,a0, 8 

or 

a0,a0,at 
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ulh 


Assembler  Input 

$4, BEGIN  ($5) 


Machine  Language  Output 


Comments 


ulh 


ulhu 


ulhu 


ulhu 


ulhu 


ulhu 


ulhu 


ulhu 


lui 

ft) 

ft 

o 

addu 

at, at , al 

lb 

aO, 0 (at) 

Ibu 

at, 1 (at) 

sll 

aO, aO, 8 

• 

or 

aO, aO, at 

$4  ,  BEGIN-r24  ($5)  lui  at,0 

addu  at , at , al 

lb  a0,24(at) 

Ibu  at, 25  (at) 

six  a0,a0,8 

or  aO,aO,at 


$4, ($5) 


$4,24 


$4,2097156 


$4,24 ($5) 


$4, 2097156  ($5) 


$4, BEGIN 


$4,BEGIN+24 


Ibu 

aO, 0  (al) 

The  ulhu  instruction  follows  a 

format 

Ibu 

at,l(al) 

identical  to  the  ulh  instruction, 

except 

sll 

aO, aO, 8 

that  is  uses  Ibu  instructions  instead  of 

or 

aO, aO, at 

lb  instructions. 
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ulhu  $4,BEGIN($5) 

• 

lui  at ,  0 

1 

addu  at , at , al 

Ibu  a0,0(at) 

Ibu  at,l(at) 

sll  a0,a0,8 

or  aO,aO,at 

ulhu  $4, BEGIN+24 ($5) 

• 

lul  at , 0 

addu  at , at , al 

Ibu  a0,24(at) 

Ibu  at, 25 (at) 

sll  a0,a0,8 

or  aO , aO , at 

ulw  $4,  ($5) 

• 

Iwl  a0,al,0 

Iwr  a0,al,3 

nop 

Although  the  expansion  for  this  instruc¬ 
tion  appears  wrong,  it  is  correct.  The 
ulw  Instruction  is  supposed  to  load  a 
word  from  memory  irrespective  of  its  byte 
alignment,  if  the  source  address  is  word- 
aligned,  then  the  Iwl  and  Iwr  instruc¬ 
tions  will  load  the  same  memory  address 
twice.  If,  however,  the  source  address  is 
not  word  aligned,  the  two  instructions  will 
each  load  a  part  of  the  source  word. 

ulw  S4 , 24 

• 

Iwl  aO, zero, 24 

Iwr  aO, zero, 27 

This  is  suboptimal  code,  since  the  as¬ 
sembler  can  determine  that  the  absolute 
expression  24  is  word  aligned.  This 
should  simply  emit  an  Iw  instruction. 

ulw  $4,2097156 

• 

lui  at, 0x20 

addiu  at, at, 4 

Iwl  aO , at , 0 

Iwr  a0,at,3 

nop 

Suboptimal  code  (see  above) 

1 

ulw  $4,24($5) 

Iwl  a0,al,24 

Iwr  a0,al,27 

"iw  S4, 2097156  ($5) 

• 

lui  at, 0x20 

addu  at , at , al 

addiu  at , at , 4 

Iwl  a0,at,0 

Iwr  aO , at , 3 

ulw  $4, BEGIN 

• 

lui  at , 0 

Iwl  a0,at,0 

Iwr  aO , at , 3 

ulw  $4, BEGIN+24 

lui  at , 0 

Iwl  a0,at,24 

Iwr  a0,at,27 

ulw  $4, BEGIN ($5) 

• 

lui  at , 0 

addu  at , at , al 

Iwl  a0,at,0 

Iwr  aO , at , 3 
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Assembler  Input 


$4, BEGIN+24  ($5) 


Machine  Language  Output 


Comments 


$4,24  lui 


$4,  ($5) 


$4,24 


$4,2097156  lui 

sb 


$4, 2097156  ($5)  lui 

addu 

sb 


$4, BEGIN  lui 

sb 


$4, BEGIN+24  lui 

sb 


$4, BEGIN ($5)  lui 

addu 

sb 


$4, BEGIN+24  ($5) 


li  aO,24 


lui  a0,0x20 

ori  a0,a0,0x4 


lui  a0,0xl8 


sb  a0,0(al) 


The  li  instruction  simply  loads  an  imme¬ 
diate  value. 


If  the  source  of  the  li  instruction  is 
larger  than  16  bits,  the  assembler  breaks 
it  up  into  twro  instructions. 


The  li  instruction  simply  loads  an  imme¬ 
diate  value. 


The  sb  instruction  follows  a  format  iden¬ 
tical  to  the  lb  instruction. 


aO , 24 (zero) 


at, 0x20 
aO , 4 (at) 


a0,24(al) 


at,  0 
aO, 0  (at) 


at,  0 

aO, 24 (at) 


at ,  0 

at ,  at ,  al 
aO, 0 (at) 


a0,0(al) 

al,4(al) 


The  sd  instruction  does  not  exist  on  the 
Mips  M/500  and  is  implemented  with  two 
sw  instructions.  The  sd  instruction  fol¬ 
lows  a  format  identical  to  the  id  instruc¬ 
tion. 


The  assembler  generates  no  code  for 
this  instruction  and  issues  no  warning 
message.  We  cannot  find  any  reason 
why  this  should  be  the  case. 
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sd 


sd 


sd 


sd 


sh 


?■;,  2097156  ($5)  lui  at, 0x20 

addu  at , at , al 

sw  a0,4(at} 

sw  al,8(at) 

$4, BEGIN  lui  at,0 

sw  a0,0(at) 

sw  al,4(at) 

$4,BEGIN+24  lui  at,0 

sw  a0,24(at) 

sw  al,28(at) 

$4,BEGIN<$5)  lui  at,0 

addu  at,at,al 

sw  a0,0(at) 

sw  al,4(at) 

$4 , BEGIN+24 ($5)  lui  at,0 

addu  at , at , al 

sw  a0,24(at) 

sw  al,28(at) 

$4, ($5)  sh  a0,0(al) 


The  sh  instruction  follows  a  format  iden¬ 
tical  to  the  Ih  instruction. 


swl  $4, ($5) 

swl  $4,24 

swl  $4,2097156 

swl  $4,24($5) 


swl 

a0,al,0 

swl 

aO, zero, 24 

lui 

at, 0x20 

swl 

a0,at, 4 

swl 

a0,al,24 

The  swl  instruction  follows  a  format 
identical  to  the  Iwi  instruction. 
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lui  at, 0x20 

addu  at , at , al 

swl  aO , at, 4 

swl  $4, BEGIN 

lui  at , 0 

swl  a0,at,0 

swl  $4,BEGIN+24 

lui  at , 0 

swl  a0,at,24 

swl  $4,BEGIN($5) 

lui  at,0 

addu  at , at , al 

swl  aO , at , 0 

swl  $4, EEGIN+24  ($5) 

lui  at , 0 

addu  at , at , al 

swl  aO,at, 24 

swr  $4,  ($5) 

swr  a0,al,0 

The  swr  instruction  follows  a  format 
identical  to  the  Iwl  instruction. 

swr  $4,24 

swr  aO, zero, 24 

swr  $4,2057156 

lui  at, 0x20 

swr  a0,at,4 

swr  $4,24  ($5) 

swr  a0,al,24 

swr  $4, 2097156  ($5) 

lui  at, 0x20 

addu  at , at , al 

swr  a0,at,4 

i 

swr  $  4 , BEGIN 

lui  at , 0 

swr  aO , at , 0 

swr  $4,BEGIN+24 

lui  at , 0 

swr  aO, at, 24 

1 

swr  $4,BEGIN($5) 

lui  at , 0 

addu  at , at , al 

swr  aO, at, 0 

swr  $4, BEGIN+24 ($5) 

lui  at , 0 

addu  at , at , al 

swr  a0,at,24 

sw  $4,  ($5) 

sw  a0,0(al} 

The  sw  instruction  follows  a  format  iden¬ 
tical  to  the  Iw  instruction. 

sw  $4,24 

sw  aO, 24 (zero) 

sw  $4,2097156 

lui  at, 0x20 

sw  a0,4(at) 

sw  $4 , 24  ($5) 

sw  a0,24(al) 

sw  $4,2097156  ($5) 

lui  at , 0x20 

addu  at , at , al 

sw  a0,4(at} 

sw  $4, BEGIN 

lui  at ,  0 

sw  a0,0(at) 

( 


( 
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Assembler  Input 

$4 , BEGIN+24 


Machine  Language  Output 


Comments 


I 


lui 


at,  0 


SW 

aO, 24 (at) 

SW 

$4 , BEGIN ($5) 

at,  0 

at , at, al 

SW 

aO, 0 (at) 

SW 

$4, BEGIN+24 ($5) 

lui 

at,  0 

addu 

at , at , al 

SW 

a0,24 (at) 

ush 

$4,  ($5) 

'  sb 

aO, 1 (al) 

The  ush  instruction  follows  a  format 

srl 

at , aO , 8 

identical  to  the  ulh  instruction,  except 

sb 

at,0(al) 

that  the  ush  instruction  uses  the  store- 
shift-store  method. 

ush 

$4,24 

sb 

aO, 25 (zero) 

This  is  suboptimal  code,  since  the  as¬ 

srl 

at , aO , 8 

sembler  can  determine  that  the  absolute 

sb 

at, 24 (zero) 

expression  24  is  halfword-aligned.  This 
should  simply  emit  an  sh  instruction. 

ush 

$4,2097156 

lui 

at , 0x20 

Suboptimal  code  (see  above  and  below) 

addiu 

at , at , 4 

sb 

a0,l(at) 

srl 

aO, aO, 8 

sb 

at, 0 (at) 

Ibu 

at, 1 (at) 

sll 

aO, aO, 8 

or 

aO, aO, at 

ush 

$4,24  ($5) 

sb 

a0,25(al) 

srl 

at , aO , 8 

sb 

at,24(al) 

ush 

$4,2097156 ($5) 

lui 

at, 0x20 

This  code  is  a  classic  example  of  a 

addu 

at, at,  al 

reason  not  to  dedicate  a  single  temporary 

addiu 

at , at , 4 

register  to  an  assembler/reorganizer,  and 

sb 

aO,l(at) 

an  argument  for  putting  reorganization 

srl 

aO, aO, 8 

into  the  compiler.  This  instruction  uses 

sb 

at, 0 (at) 

at  as  a  temporary  register  in  the  calcula¬ 

Ibu 

at, 1 (at) 

tion  of  the  destination  address.  However, 

sll 

aO , aO , 8 

since  the  single  temporary  register  is  in 

or 

aO,aO, at 

use  for  that  purpose,  it  must  destructively 
shift  aO  to  the  right  to  perform  both  sb 
instructions.  It  must  then  re-shift  aO  to 
the  left,  and  re-load  the  previously  stored 
value  to  reconstruct  the  original  value  in 
aO.  If  this  value  is  never  used  again, 
three  instructions  are  wasted  (and  as  it 
is,  a  single  Mips  instruction  gets  ex¬ 
panded  to  nine  times  its  original  size). 
For  a  discussion  of  this  and  other 
deleterious  effects  of  the  reorganizer,  see 
Chapter  7. 
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ush 

$4, BEGIN 

lui 

sb 

srl 

sb 

Ibu 

sll 

or 

at ,  0 
aO, 1 (at) 
aO, aO, 8 
at,  0 (at) 
at,  1 (at) 
aO, aO, 8 
aO, aO,  at 

Suboptimal,  see  above. 

ush 

$4 , BEGIN+24 

lui 

at,  0 

Suboptimal,  see  above. 

sb 

aO, 25  (at) 

srl 

aO, aO, 8 

1 

1 

sb 

at, 24 (at) 

Ibu 

at, 25 (at) 

sll 

a0,a0, 8 

or 

a0,a0,at 

ush 

$4, BEGIN  ($5) 

lui 

at,  0 

Suboptimal,  see  above. 

addu 

at,at,al 

sb 

aO, 1 (at) 

srl 

aO, aO, 8 

sb 

at,  0 (at) 

Ibu 

at,  1 (at) 

sll 

a0,a0, 8 

or 

aO, aO, at 

ush 

$4, BEGIN+24 ($5) 

lui 

ft 

o 

Suboptimal,  see  above. 

addu 

at , at , al 

sb 

aO,  25 (at) 

srl 

a0,a0, 6 

sb 

at, 24 (at) 

Ibu 

at,  25  (at) 

sll 

aO,aO, 8 

or 

aO,aO, at 

usw 

swl 

aO, al, 0 

The  usw  instruction  follows  a  format 

swr 

aO, al, 3 

identical  to  the  ulw  instruction. 

usw 

$4,24 

swl 

aO , zero, 24 

swr 

aO , zero, 27 

usw 

$4,2097156 

lui 

at , 0x20 

addiu 

at , at , 4 

swl 

a0,at, 0 

swr 

a0,at,3 

usw 

$4,24 ($5) 

swl 

a0,al,24 

swr 

a0,al,27 

USV 

$4, 2097155 ($5) 

lui 

at, 0x20 

addu 

at , at , al 

addiu 

at , at , 4 

swl 

aO , at , 0 

swr 

a0,at,3 

usw 

$4, BEGIN 

lui 

»t,0 

swl 

a0,at,0 

1 

swr 

a0,at, 3 

( 


# 


% 
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• 

• 

Assembler  Input 

usw 

$4  ,  BEGIN-(-24 

• 

US  w 

$4 , BEGIN  ($5) 

usw 

$4,  BEGIN-l-24  ($5) 

• 

Machine  Language  Output 


Comments 


at,  0 

aO, at, 24 
aO, at, 27 


at,  0 
at  at ,  al 
aO , at , 0 
aO , at , 3 


abs  $4,$0 


neg  $4, $5 


neg  $4,$0 


negu  $4 


negu  $4, $5 


bgez 

aO, Oxc 

nop 

sub 

aO, zero, aO 

bgez 

al, Oxc 

move 

aO,  al 

sub 

aO, zero, al 

bgez 

zero, Oxc 

move 

aO, zero 

sub 

aO, zero, zero 

sub 

aO, zero, aO 

sub 

aO, zero, al 

sub 

aO, zero, zero 

subu 

aO, zero, aO 

subu 

aO, zero, al 

The  Mips  high-level  assembler  has  an 
abs  instruction,  but  there  is  no  cor¬ 
responding  instruction  in  the  machine 
language.  Instead,  the  assembler  reor¬ 
ganizer  translates  the  abs  instruction  into 
a  test,  branch,  and  negate  triplet.  This 
causes  a  3:1  increase  in  execution  time 
for  this  instruction.  Statistically,  this  in¬ 
crease  is  of  small  significance,  since  the 
abs  instruction  is  rarely  used  in  compiled 
code. 


As  shown  in  Figure  3-4  on  page  8,  the 
move  instruction  has  been  shifted  down 
to  fill  the  nop  after  the  bgez  instruction. 
The  move  is  always  executed,  whether  or 
not  the  branch  is  taken. 


The  absolute  value  of  zero  is  obviously 
zero,  so  that  while  this  code  expansion  is 
correct,  it  would  be  more  reasonable  to 
change  it  to  move  aO,  zero. 


The  Mips  M/500  does  not  have  a  negate 
instruction  but  performs  this  operation  by 
subtracting  the  number  from  zero.  De¬ 
pending  on  whether  a  signed  or  unsigned 
negate  is  desired,  a  sub  or  subu  instruc¬ 
tion  is  used.  Since  the  cycle  count  for 
this  operation  is  still  1 ,  there  is  no  sacri¬ 
fice  in  execution  speed. 


The  negative  of  0  is  still  0.  This  instruc¬ 
tion  could  be  replaced  with 
move  a0,zero,  although  its  current 
form  is  no  more  expensive  to  execute. 
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negu  $4 , $0 

1 

subu  aO, zero, zero 

The  negative  of  0  is  still  0.  This  instruc¬ 
tion  could  be  replaced  with 

move  a0,zero,  although  its  current 
form  is  no  more  expensive  to  execute. 

not  $4 

nor  a0,a0,zero 

The  Mips  M/500  does  not  have  a  comple-  0 

ment  instruction  but  performs  this  opera¬ 
tion  by  executing  a  nor  with  0,  Since  the 
cycle  count  for  this  instruction  is  still  1 , 

1  there  is  no  sacrifice  in  execution  speed. 

nor  a0,al,zero 

not  $0 

nor  zero, zero, zero 

The  zero  register  as  a  destination  is 
meaningless.  This  instruction  should  be 
elided  or  replaced  with  a  nop  instruction. 

not  $4 , $0 

nor  aO, zero, zero 

1 

This  expansion  makes  sense,  especially 
when  considered  as  the  fastest  way  to 
load  a  register  full  of  ones.  ^ 

add  54, $5 

add  aO , aO, al 

add  $4,55,56 

add  aO , al , a2 

add  54, 55, SC 

add  a0,al,zero 

1 

add  54,55,0 

add!  aO , al , 0 

0 

add  54,50 

add  a0,a0,zero 

This  instruction  sequence  does  nothing, 
and  should  be  elided  by  the  assembler 
reorganizer. 

add  $4,0 

This  instruction  sequence  does  nothing, 
and  should  be  elided  by  the  assembler 
reorganizer.  ^ 

add  $4,50,55 

add  aO,zero,al 

This  instruction  could  be  replaced  by  a 
move  a0,al.  However,  performing  the 
add  incurs  no  extra  expense. 

add  $4,55,15 

addi  aO , al , 15 

add  $4,55,2097153 

lui  at, 0x20 

ori  at ,  at ,  0x1 

add  aO , al , at 

The  Mips  M/500  native  instruction  set  0 

limits  the  size  of  immediate  operands  to 

16  bits.  Therefore,  when  a  large  con¬ 
stant  value  is  needed,  it  is  loaded  in  16- 
bit  halves.  The  lui  instruction  loads  the 
upper  half  of  the  register  (clearing  the 
lower  half),  while  the  ori  instruction  OR’s  ^ 

in  the  lower  half. 

add  $4,55,2097152 

lui  at, 0x20 

add  aO , al , at 

! 

When  an  immediate  operand  is  larger 
than  16  bits  long,  but  the  bottom  16  bits 
are  zeroes,  the  assembler  reorganizer 
never  generates  the  ori  instruction. 

addu  $4,55 

addu  a0,a0,al 

• 

addu  $4 , $5, $6 

addu  aO ,  al ,  a2 
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addu 

$4, $5, SO 

move 

aO,  al 

There  is  no  actual  move  instruction  on 
the  Mips  M/500.  Instead,  the  assembler 
allows  it  as  a  pseudo-instruction,  and  en¬ 
codes  it  as  an  addu  with  zero.  The  dis¬ 
assembler  also  knows  of  this  mapping, 
which  accounts  for  the  translation  shown 

here. 

addu 

$4, $5, 0 

addiu 

aO, al, 0 

addu 

$4,  $0 

move 

o 

o 

This  instruction  sequence  clearly  does 
nothing  and  should  be  elided  by  the  as¬ 
sembler  reorganizer. 

addu 

$4, 0 

addiu 

o 

o 

o 

This  instruction  sequence  does  nothing 
and  should  be  elided  by  the  assembler 
reorganizer. 

addu 

$4, $0, $5 

addu 

aO, zero, al 

This  instruction  could  be  replaced  by  a 
move  a0,al.  However,  performing  the 
add  incurs  no  extra  expense. 

addu 

$4, $5, 15 

addiu 

a0,al,15 

addu 

$4,  $5, 2097153 

lui 

at, 0x20 

ori 

at, at, 0x1 

addu 

aO, al, at 

and 

S4,  $5 

and 

aO, aO, al 

and 

$4, S5, S6 

and 

a0,al,a2 

and 

$4, $5, $0 

and 

aO, al, zero 

and 

$4,S5,0 

andi 

a0,al, 0 

and 

</> 

O 

and 

a0,a0, zero 

This  could  be  replaced  by 

move  a0,zero.  Keeping  the  and  in¬ 
struction,  however,  incurs  no  extra  ex- 

pense. 

and 

S4, 0 

1 

andi 

a0,a0, 0 

This  could  be  replaced  by 

move  a0,zero.  Keeping  the  andi  in¬ 
struction,  however,  incurs  no  extra  ex¬ 
pense.  Notice,  however,  how  the  as¬ 
sembler  reorganizer  again  treats  the  con¬ 
stant  value  0  and  the  zero  register  differ¬ 
ently. 

and 

$4, $0, $5 

and 

aO, zero, al 

This  could  also  be  replaced  by 
move  a0,zero.  Keeping  the  and  in¬ 
struction,  however,  irx;urs  no  extra  ex- 

pense. 

and 

$4, $5, 15 

andi 

a0,al,0x£ 

and 

$4, $5,2097153 

lui 

at, 0x20 

ori 

at , at , 0x1 

and 

aO , al , at 
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and  $4, $5, 2097152 

lui  at, 0x20 

and  a0,al,at 

div  $4, $5 

div  a0,al 

bne  al, zero, 0x10 

nop 

break  7 

li  at,-l 

bne  al, at, 0x28 

lui  at, 0x8000 

bne  aO, at, 0x28 

nop 

break  6 

mflo  aO 

nop 
nop 

This  expansion  is  a  little  complicated  in 
that  the  overflow  checking  advertized  in 
the  documentation  is  done  at  run-time  by 
the  software  and  not  by  the  Mips  M/500 
div  instruction.  The  first  test  is  for  divi¬ 
sion  by  zero,  with  a  branch  to  the 
break  7  if  this  is  the  Case.  The  second 
test  is  for  division  of  the  largest  negative 
number  by  -1  (effectively  taking  the  abso¬ 
lute  value  of  the  largest  negative 
number).  Since  there  is  one  more  nega¬ 
tive  number  than  positive  number  in  two's 
complement  arithmetic,  this  would  be  an 
overflow  condition,  so  the  code  tests  for  it 
and  branches  to  the  break  6  if  this  is 
the  case. 

div  S4,$5,S6 

div  al ,  a2 

bne  a2, zero, 0x10 

nop 

break  7 

li  at,-l 

bne  a2, at, 0x28 

lui  at, 0x8000 

bne  al, at, 0x28 

nop 

break  6 

mflo  aO 

nop 
nop 

div  $4,$5,$0 

div  al, zero 

bne  zero, zero, 0x10 

nop 

break  7 

li  at,-l 

bne  zero, at, 0x28 

lui  at, 0x8000 

bne  al, at, 0x28 

nop 

break  € 

mflo  aO 

nop 
nop 

Even  though  this  instruction  is  performing 
a  divide  by  zero  (by  using  the  zero 
register,  which  always  contains  the  con¬ 
stant  value  0),  the  assembler  reorganizer 
does  not  issue  an  error  message.  The 
error  will  still  be  detected  at  run-time, 
though,  so  this  translation  is  legal,  though 
suboptimal. 

div  $4, $5,0 

1 

1 

break  7 

The  assembler  reorganizer  here  correctly 
detects  a  divide  by  zero  and  simply 
generates  a  break  7  instruction  (which 
traps  to  an  error  harxller  at  rurvtime), 
rather  than  actually  generating  a  se¬ 
quence  of  instructions  that  will  divide  by 
zero. 
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div  $4,$0,$5 

div  zero, al 

bne  al, zero, 0x64 

nop 

break  7 

li  at,-l 

bne  al,at,0z7c 

lui  at, 0x8000 

bne  zero,at,0x7c 

nop 

break  6 

mflo  aO 

nop 
nop 

The  assembler  reorganizer  fails  to  recog¬ 
nize  that  a  dividend  of  zero  always 
results  in  a  quotient  of  zero,  unless  the 
divisor  is  also  zero.  The  code  here  could 
be  correspondingly  shortened  and  sped 
up  (through  the  elimination  of  the  div 
instruction). 

div  $4 , $0 

div  aO, zero 

bne  zero, zero, 0x98 

nop 

break  7 

li  at,-l 

bne  zero, at, OxbO 

lui  at, 0x8000 

bne  aO, at, OxbO 

nop 

break  6 

mflo  aO 

nop 
nop 

Even  though  this  instruction  is  performing 
a  divide  by  zero  (by  using  the  zero 
reg  ster,  which  always  contains  the  con¬ 
stant  value  0),  the  assembler  reorganizer 
does  not  issue  an  error  message.  The 
error  will  still  be  detected  at  run-time,  so 
this  translation  is  legal,  though  sub- 
optimal. 

div  $4,0 

break  7 

The  assembler  reorganizer  here  correctly 
detects  a  divide  by  zero  and  simply 
generates  a  break  7  instruction  (which 
traps  to  an  error  handler  at  run-time), 
rather  than  actually  generating  a  se¬ 
quence  of  instructions  that  will  divide  by 
0. 

div  $4, $5, 15 

li  at, 15 

div  al ,  at 

mflo  aO 

nop 
nop 
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div  $4, $5, 2097152 

bgez  al,OxlO 

move  at , al 

lui  at , 0x20 

addiu  at , al , -1 

sra  a0,at,21 

Notice  that  division  by  a  power  ot  two  is 
accomplished  by  simply  arithmetically 
shitting  the  source  register  to  the  right.  If 
the  source  register  is  negative,  then  it  is 
decremented  by  1  prior  to  shifting  to  in¬ 
sure  correct  results  (without  the 
decrementation,  -5 » 1  yields  -3,  al¬ 
though  -5  /  2  =  -2).  Notice  also  that  the 
effects  of  the  move  instruction  are  can¬ 
celed  if  the  branch  is  not  taken 
(remember  that  the  move  executes  be¬ 
fore  the  bgez  completes),  but  that  the 
move  instruction  is  necessary  if  the 
branch  is  taken.  Contrast  this  behavior 
with  that  of  the  divu  instruction  on  page 
165. 

div  $4, $5, 2097153 

lui  at , 0x20 

ori  at , at , 0x1 

div  al , at 

m£lo  aO 

nop 

nop 

divu  $4 , $5 

divu  aO ,  al 

bne  al , zero, 0x10 

nop 

break  7 

mflo  aO 

nop 
nop 

divu  S4, $5, $6 

divu  al ,  a2 

bne  a2, zero, 0x10 

nop 

break  7 

mflo  aO 

nop 
nop 

divu  $4, $5, $0 

divu  al, zero 

bne  zero, zero, 0x10 

nop 

break  7 

mflo  aO 

nop 
nop 

Even  though  this  instruction  is  performing 
a  divide  by  zero  (by  using  the  zero 
register,  which  always  contains  the  con¬ 
stant  value  0),  the  assembler  reorganizer 
does  not  issue  an  error  message.  The 
error  will  still  be  detected  at  run-time,  so 
this  translation  is  legal,  though  sub- 
optimal. 

divu  $4, $5,0 

break  7 

The  assombler  reorganizer  here  correctly 
detects  a  divide  by  zero,  and  simply 
generates  a  break  7  instruction  (which 
traps  to  an  error  handler  at  run-time), 
rather  than  actually  generating  a  se¬ 
quence  of  instructions  that  will  divide  by 
0. 
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divu  $4 , $0 , $5 

divu  zero,al 

bne  al,zero,0xd0 

nop 

break  7 

mflo  aO 

nop 

nop 

The  assembler  reorganizer  fails  to  recog¬ 
nize  that  a  dividend  of  zero  always 
results  in  a  quotient  of  zero,  uriless  the 
divisor  is  also  zero.  The  code  here  could 
be  correspondingly  shortened  and  sped 
up  (through  the  elimination  of  the  div 
instruction). 

divu  $4,$0 

divu  aO, zero 

bne  zero, zero, Oxec 

nop 

break  7 

mflo  aO 

nop 
nop 

Even  though  this  instruction  is  performing 
a  divide  by  zero  (by  using  the  zero 
register,  which  always  contains  the  con¬ 
stant  value  0),  the  assembler  reorganizer 
does  not  issue  an  error  message.  The 
error  will  still  be  detected  at  run-time,  so 
this  translation  is  legal,  though  sub- 
optimal. 

divu  $4,0 

break  7 

The  assembler  reorganizer  here  correctly 
detects  a  divide  by  zero  and  simply 
generates  a  break  7  instruction  (which 
traps  to  an  error  handler  at  run-time), 
rather  than  actually  generating  a  se¬ 
quence  of  instructions  that  will  divide  by 
0. 

mmm 

li  at, 15 

divu  al , at 

mflo  aO 

nop 
nop 

divu  $4, $5, 2097152 

srl  a0,al,21 

Notice  that  division  by  a  power  of  two  is 
accomplished  by  shifting  the  source  to 
the  right.  There  is  no  check  for  negative 
numbers  here  as  there  was  with  the  div 
instruction  on  page  164.  This  is  because 
the  divu  instruction  is  designed  to 
operate  only  on  unsigned  (i.e.,  positive) 
numbers. 

divu  $4, $5, 2097153 

lui  at, 0x20 

ori  at , at , 0x1 

divu  al , at 

mflo  aO 

nop 

nop 

xor  $4 , $5 

xor  a0,a0,al 

xor  $4 , $5 , $6 

xor  aO , al , a2 

xor  $4,$5,$0 

xor  a0,al,zero 

This  instruction  sequence  is  equivalent  to 
move  a0,al.  However,  since  both  in¬ 
structions  take  a  single  cycle  to  execute, 
there  is  no  penalty  at  run-time. 
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xor  $4 , $5 , 0 

xori  a0,al,0 

This  instruction  sequence  is  equivalent  to 
move  a0,al.  However,  since  both  in¬ 
structions  take  a  single  cycle  to  execute, 
there  is  no  penalty  at  run-time. 

xor  $4 , $0 , $5 

xor  a0,zero,al 

A 

xor  $4 , $0 

xor  a0,a0,zero 

This  instruction  complements  aO,  and 
could  also  have  been  written  as 
nor  aO, aO , zero. 

xor  $4,0 

xori  a0,a0,0 

xor  $4 , $5 , 15 

xori  a0,al,0x£ 

m 

xor  $4, $5, 2097153 

lui  at.,  0x20 

ori  at , at , 0x1 

xor  a0,al,at 

mul  $4 , $5 

multu  aO/al 

toflo  aO 

nop 
nop 

• 

rr.ul  $4,  $5,  $6 

XQuItu  al,a2 

xofXo  aO 

nop 
nop 

% 

mul  $4,S5,$0 

multu  al^zero 

1  m£lo  aO 

nop 
nop 

While  the  assembler  reorganizer  is  smart 

enough  to  recognize  that  a  multiply  by  a 

constant  value  zero  produces  a  zero 

result,  it  does  not  correctly  handle  the 

case  of  multiplication  by  the  zero  register, 

and  instead  causes  the  multiplication  to  _ 

be  needlessly  executed.  This  is  the  case  ^ 

for  all  types  of  multiply  instructions. 

mul  $4 , $5 , 0 

move  aO , zero 

mul  $4 , $0 , $5 

multu  zero,al 

mflo  aO 

nop 
nop 

The  assembler  reorganizer  should  code 
this  as  move  a0,zero,  instead  of  con¬ 
suming  many  cycles  performing  a  multi-  0 

plication  by  zero. 

mul  $4,$0 

multu  aO,zero 

mflo  aO 

nop 
nop 

The  assembler  reorganizer  should  code 
this  as  move  aOfZero,  instead  of  con¬ 
suming  many  cycles  performing  a  multi¬ 
plication  by  zero. 

mul  $4,0 

move  aO , zero 

mul  $4 , $5, 15 

all  a0,al,4 

subu  aO,aO,al 

Multiplication  by  a  constant  is  converted 
into  a  sequence  of  shifts  and  adds  (or 
subtracts).  See  Section  3.2.1  for  more 
details. 
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mul  $4,35,2097152 

sll  a0,al,21 

The  mul  instruction  is  substantially  faster 
than  the  mulo  instruction  (see  page  168), 
since  it  does  not  have  to  check  for  over¬ 
flow  (the  sll  instruction  used  here  does 
not  register  a  numeric  overflow). 

mul  $4,35,2097153 

sll  a0,al,21 

addu  a0,a0,al 

The  mul  instruction  is  substantially  faster 
than  the  mulo  instruction  (see  page  168), 
since  it  does  not  have  to  check  for  over¬ 
flow  (the  sll  instruction  used  here  does 
not  register  a  numeric  overflow). 

mulo  $4,35 

mult  aO,al 

mflo  aO 

sra  aO,aO,31 

mfhi  at 

beq  a0,at,0xlc 

mflo  aO 

break  € 

nop 

mulo  $4, $5, $6 

mult  al , a2 

mflo  aO 

sra  a0,a0,31 

mfhi  at 

beq  a0,at,0xlc 

mflo  aO 

break  € 

nop 

mulo  $4 , $5 , $0 

mult  al,zero 

mflo  aO 

sra  aO,aO,31 

mfhi  at 

beq  a0,at,0xlc 

mflo  aO 

break  6 

nop 

While  the  assembler  reorganizer  is  smart 
enough  to  recognize  that  a  multiply  by  a 
constant  value  zero  produces  a  zero 
result,  it  does  not  correctly  handle  the 
case  of  multiplication  by  the  zero  register, 
and  instead  causes  the  multiplication  to 
be  needlessly  execute'*.  This  is  the  case 
for  all  types  of  multiply  instructions. 

mulo  $4,35,0 

move  aO , zero 

mulo  $4 , $0, $5 

j 

1 

mult  zero,al 

mflo  aO 

sra  a0,a0,31 

mfhi  at 

beq  a0,at,0xlc 

mflo  aO 

break  € 

nop 

The  assembler  reorganizer  should  code 
this  as  move  aO,  zero,  instead  of  con¬ 
suming  many  cycles  performing  a  multi¬ 
plication  by  zero. 
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mulo 

$4, $5. 15 

j 

add 

add 

add 

add 

add 

add 

a0,al,al 
a0,a0,al 
a0,a0,a0 
a0,a0,al 
aO, aO, aO 
aO , aO , al 

Note  that  this  sequence  of  instructions  al¬ 
lows  for  the  overflow  checking  described 
in  the  documentation  (since  the  add  in¬ 
struction  can  signal  an  overflow 
condition).  Contrast  this  with  the  multipli¬ 
cation  by  a  constant  using  the  mul  in¬ 
struction  on  page  167.  Also,  see  Section 
3.2.1  for  a  more  detailed  anaivsis  of  mul¬ 
tiplication  instruction  expansion. 

mulo 

$4, $5, 2C97153 

lui 

at , 0x20 

ori 

at , at , 0x1 

mult 

al,  at 

mflo 

aO 

sra 

a0,a0, 31 

m£hi 

at 

beq 

aO , at , 0x24 

mflo 

aO 

break 

€ 

nop 

mulou 

$4,  $5 

multu 

aO ,  al 

mfhi 

at 

beq 

at , zero, 0x14 

mflo 

aO 

break 

6 

nop 

mulou 

$4, $5, $6 

multu 

al ,  a2 

xnfhi 

ac 

beq 

at, zero, 0x14 

mflo 

aO 

break 

6 

nop 

mulou 

$4, S5, $0 

multu 

al, zero 

While  the  assembler  reorganizer  is  smart 

mfhi 

at 

enough  to  recognize  that  a  multiply  by  a 

beq 

at , zero, 0x14 

constant  value  zero  produces  a  zero 

mflo 

aO 

result,  it  does  not  correctly  handle  the 

break 

6 

case  of  multiplication  by  the  zero  register, 

nop 

and  instead  causes  the  multiplication  to 

be  needlessly  executed.  This  is  the  case 

for  all  types  of  multiply  instructions. 

mulou 

$4, $5, 0 

move 

aO , zero 

mulou 

$4,$0,$5 

multu 

zero, al 

The  assembler  reorganizer  should  code 

mfhi 

at 

this  as  move  a0,zero,  instead  of  con- 

beq 

at, zero, 0x14 

suming  many  cycles  performing  a  multi- 

mflo 

aO 

plication  by  zero. 

break 

6 

nop 
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mulou 

$4,  $0 

multu 

mfhi 
beq 
mf  lo 

break 

nop 

aO, zero 

at 

at , zero, 0x14 
aO 

6 

The  assembler  reorganizer  should  code 
this  as  move  aO,zero,  instead  of  con¬ 
suming  many  cycles  performing  a  multi- 
plicafion  by  zero. 

mulou 

S4  ,  0 

move 

aO, zero 

mul  cu 

S  4 , S  5 , 15 

li 

multu 

mfhi 
beq 
mf  lo 

break 

nop 

at,  15 
al,  at 

at 

at , zero, 0x18 
aO 

6 

1 

1 

S4, $5,2097153 

i 

lui 

ori 

mul  tu 

mfhi 

beq 

mflo 

break 

nop 

at, 0x20 
at , at, 0x1 
al,  at 

at 

at, zero, 0x1 c 
aO 

6 

nor 

S4,S£ 

nor 

a0,a0,al 

nor 

S4, S5, $6 

nor 

a0,al,a2 

nor 

$4, $5, SO 

nor 

a0,al, zero 

nor 

S4, $5,  0 

:  ori 

nor 

a0,al,0 
aO, aO, zero 

The  assembler  reorganizer  fails  to  recog¬ 
nize  the  spec.al  case  of  a  nor  with  a 
constant  value  0,  and  generates  one 
extra  instruction  here.  The  correct  be- 
ho.-  or  would  be  to  simply  perform  a  nor 
aO, al , zero. 

nor 

$4,  $4 

nor 

a0,a0,a0 

ncr 

S  4  ,  S  C  ,  S  5 

nor 

aO, zero, al 

nor 

S4,  SO 

nor 

aO, aO, zero 

nor 

S4, 0 

ori 

nor 

a0,a0, 0 
aO, aO, zero 

The  assembler  reorganizer  fails  to  recog¬ 
nize  the  special  case  of  a  nor  with  a 
constant  value  0,  and  generates  one 
extra  instruction  here.  The  correct  be¬ 
havior  would  be  to  simply  perform  a  nor 
aO, aO , zero. 
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$4, $5, 15 


aO, al, Oxf 
aO, aO, zero 


'  The  assembler  reorganizer  breaks  the 
simple  nor  instruction  into  two  instruc¬ 
tions  (an  ori  and  a  nor).  Since  the 
Mips  M/500  native  instruction  set  has  a 
nor  in  its  repertoire,  we  can  conclude 
that  either  the  assembler  reorganizer  is 
making  a  mistake  here  or  that  the  native 
instruction  set  is  not  orthogonal,  and  that 
the  nor  instruction  cannot  be  executed 
with  an  immediate  operand. 


S4, $5, 20S7153 


$4, S5, $6 


S4, $5, SO 


at. ,  0x20 
at , at , 0x1 
aO, al, at 


aO , aO, al 


aO, al, a2 


aO , al , zero 


$4, S5, 0 


S4, SO, S5 


$4,  SO 


aO , al, 0 


aO, zero,al 


kO,aO, zero 


aO, aO, 0 


aO, al, Oxf 


at , 0x20 
at , at , 0x1 
aO,al,at 


An  or  with  zero  could  easily  be  be  trans¬ 
lated  into  move  aO,al,  but  since  there 
is  no  additional  overhead  in  not  doing 
that,  the  assembler  reorganizer  is  behav¬ 
ing  appropriately.  Where  the  source  and 
destination  registers  are  identical,  the  or 
can  be  deleted  entirely  in  this  case,  the 
assembler  reorganizer  fails  to  recognize 
this  shortcut. 


This  could  also  be  translated  into 
move  aO,al,  with  no  greater  or  lesser 
run-time  expense. 


This  instruction  does  nothing  and  should 
be  elided  by  the  assembler  reorganizer. 


This  instruction  also  does  nothing,  and 
should  be  elided  by  the  assembler  reor¬ 
ganizer. 


The  ori  instruction  is  to  load  in  the  lower 
half  of  the  constant  2097153.  The  or 
instruction  performs  the  actual  work. 
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rem  $4 , $5 , S  6 


rem  $4, $5, $0 


rem  $4 , S5, 0 


Machine  Language  Output  Comments 


div 

aO,  al 

bne 

al , zero, 0x10 

nop 

break 

7 

li 

at,  -1 

bne 

al, at , 0x28 

lui 

at, 0x8000 

bne 

aO, at, 0x28 

nop 

break 

6 

mfhi 

aO 

nop 

nop 

div 

al,a2 

bne 

a2, zero, 0x10 

nop 

break 

7 

li 

at,  -1 

bne 

a2, at, 0x28 

lui 

at, 0x8000 

bne 

al, at, 0x28 

nop 

break 

6 

mfhi 

aO 

nop 

nop 

div 

al, zero 

bne 

zero, zero, 0x10 

nop 

break 

7 

li 

at,  -1 

bne 

zero, at, 0x28 

lui 

at, 0x8000 

bne 

al, at , 0x28 

nop 

break 

6 

mfhi 

aO  nop 

nop 

Even  though  this  instruction  is  performing 
a  divide  by  zero  (by  using  the  zero 
register,  which  always  contains  the  con¬ 
stant  value  0),  the  assembler  reorganizer 
does  not  issue  an  error  message.  The 
error  will  still  be  detected  at  run-time,  so 
this  translation  is  legal,  though  sub- 
optimal. 


break  7 


The  assembler  reorganizer  here  correctly 
detects  a  divide  by  zero  and  simply 
generates  a  break  7  instruction  (which 
traps  to  an  error  handler  at  run-time), 
rather  than  actually  generating  a  se¬ 
quence  of  instructions  that  will  divide  by 
0. 
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reiT. 

$4, $0,55 

div 

zero, al 

1  This  instruction  should  be  recoded  much 

bne 

al, zero, 0x10 

more  simply,  since  a  division  does  not 

nop 

need  to  be  performed  when  the  dividend 

break 

7 

of  a  remainder  operation  is  zero. 

li 

at ,  -1 

bne 

al, at, 0x26 

w 

lui 

at, 0x8000 

bne 

zero, at, 0x28 

nop 

break 

6 

mfhi 

aO 

nop 

• 

nop 

rerr. 

S  4  ,  $  0 

div 

aO , zero 

This  instruction  should  be  recoded  much 

bne 

zero, zero, 0x10 

more  simply,  since  a  division  does  not 

nop 

need  to  be  performed  when  the  dividend 

break 

7 

of  a  remainder  operation  is  zero. 

li 

at,  -1 

• 

bne 

zero, at, 0x28 

1 

lui 

at, 0x8000 

1 

bne 

aO, at, 0x28 

nop 

break 

€ 

mfhi 

aO 

• 

nop 

nop 

reir. 

$4,0 

break 

7 

The  assembler  reorganizer  here  correctly 
detects  a  divide  by  zero  arxf  simply 
generates  a  break  7  instruction  (which 
traps  to  an  error  handler  at  run-time), 
rather  than  actually  generating  a  se¬ 
quence  of  instructions  that  will  divide  by 

0. 

• 

rerr. 

$^,$5,15 

li 

at,  15 

div 

al,  at 

mfhi 

aO 

# 

nop 

nop 

rem 

$4, $5,2097152 

lui 

at, 0x20 

addiu 

at,at,-l 

bgez 

al , Oxlc 

and 

a0,al,at 

• 

beq 

aO, zero, Oxlc 

addiu 

at ,  at ,  1 

subu 

a0,a0,at 

rem 

$4, $5, 2097153 

lui 

at, 0x20 

ori 

at , at , 0x1 

div 

al,  at 

• 

mfhu 

aO 

nop 

nop 
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remu  $4 , $5 

divu  aO,al 

bne  al, zero, 0x10 

nop 

break  7 

m£hl  aO 

nop 
nop 

remu  $4 , $5, $6 

• 

divu  al , a2 

bne  a2, zero, 0x10 

nop 

break  7 

mfhi  aO 

nop 
nop 

remu  S4, $5, $0 

• 

divu  al , zero 

bne  zero, zero, 0x10 

nop 

break  7 

mfhi  aO 

nop 
nop 

Even  though  this  instruction  is  performing 
a  divide  by  zero  (by  using  the  zero 
register,  which  always  contains  the  con¬ 
stant  value  0),  the  assembler  reorganizer 
does  not  issue  an  error  message. 

remu  $4, $5,0 

• 

break  7 

j 

The  assembler  reorganizer  here  correctly 
detects  a  divide  by  zero  and  simply 
generates  a  break  7  instruction  (which 
traps  to  an  error  handler  at  run-time), 
rather  than  actually  generating  a  se¬ 
quence  of  instructions  that  will  divide  by 
0. 

remu  $4, $0, $5 

m 

divu  zero,al 

bne  al, zero, 0x10 

nop 

break  7 

mfhi  aO 

nop 
nop 

This  instruction  should  be  recoded  much 
more  simply,  since  a  division  does  not 
need  to  be  performed  when  the  dividend 
of  a  remainder  operation  is  zero. 

^  remu  $  4 ,  $  0 

• 

divu  a0,zero 

bne  zero, zero, 0x10 

nop 

break  7 

xafhi  aO 

nop 
nop 

This  instruction  should  be  recoded  much 
more  simply,  since  a  division  does  not 
need  to  be  performed  when  the  dividend 
of  a  remainder  operation  is  zero. 

remu  $4,0 

• 

break  7 

1 

The  assembler  reorganizer  here  correctly 
detects  a  divide  by  zero  and  simply 
generates  a  break  7  instruction  (which 
traps  to  an  error  handler  at  run-time), 
rather  than  actually  generating  a  se¬ 
quence  of  instructions  that  will  divide  by 
0. 
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remu  $4, $5, 15  i 

i 

li  at, 15 

divu  al,at 

m£hi  aO 

nop 
nop 

remu  $4,35,2097152 

lui  at, 0x20 

addiu  at,at,-l 

and  aO , al , at 

remu  S 4 , $  5 , 2 0 97 1 53 

lui  at, 0x20 

ori  at , at , 0x1 

divu  al,at 

m£hi  aO 

nop 

nop 

• 

rol  $4, $5 

subu  at, zero, al 

^  srlv  at , aO, at 

sllv  aO , aO, al 

or  aO, aO, at 

The  Mips  M/500  native  instruction  set 
does  not  have  a  rotate  instruction.  What 
the  assembler  reorganizer  does  is  to  ^ 

rotate  the  source  register  both  right  and 
left  and  merge  the  result  into  the  destina¬ 
tion  register.  For  a  rol  instruction,  the 
source  word  is  logically  (not 
arithmetically)  rotated  right  by  the 
negative  of  the  rotation  amount.  Since 
the  native  instruction  set  specifies  that  ® 

the  shift  amount  is  taken  modulo  32,  this 
translates  into  a  shift  right  by  the  correct 
number  of  bits.  The  same  register  is 
then  rotated  left  by  the  specified  amount, 
and  the  results  are  merged  together  with 
an  or  instruction.  0' 

rol  $4, $5, $6 

subu  at, zero, a2 

srlv  at,al,at 

sllv  a0,al,a2 

or  a0,a0,at 

rol  $4,$5,$0 

subu  at, zero, zero 

srlv  at,al,at 

1  sllv  a0,al,zero 

or  a0,a0,at 

The  assembler  reorganizer  should  trans-  ^ 

late  this  instruction  to  a  move  $4, $5,  ^ 

since  a  rotation  by  zero  bits  is  no  rotation 
at  all.  Instead,  it  incorrectly  generates 
the  superfluous  rotation  code. 

rol  $4,0 

This  instruction  does  not  assemble  at  all 

and  generates  the  assembler  run-time  ^ 

error  "(fimmed  >«  0)  and  (fimmed  <-  31)’ 

from  .Jaslemit.p,  line  588.  The  correct 

action  would  be  to  ignore  this  instruction. 

Mips  Inc.  claims  that  this  bug  is  fixed  in  a 
newer  release  of  the  assembler. 
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rol 

$4, $5,0 

1 

1  This  instruction  does  not  assemble  at  all 

1  and  generates  the  assembler  run-time 
error  "(fimmed  >=  0)  and  (fimmed  <*  3Jj" 
from  .yaslemit.p,  line  588.  The  correct 
action  would  be  to  ignore  this  instruction. 

rol 

$4, $0, $5 

subu 

srlv 

sllv 

or 

at, zero, al 
at, zero, at 
aO, zero, al 
aO, aO, at 

This  instruction  should  be  recoded  as 
move  aO,zero,  since  rotating  zero  by 
any  number  of  bits  (especially  zero  bits) 
still  yields  zero. 

rol 

$4,  $0 

subu 

srlv 

sllv 

or 

at, zero, zero 
at, aO, at 
aO,  aO, zero 
aO, aO, at 

This  instruction  should  be  recoded  as 
move  a0,2ero,  since  rotating  zero  by 
any  number  of  bits  (especially  zero  bits) 
still  yields  zero. 

rol 

$4, $5, 15 

sll 

srl 

or 

at, al, 15 
aO,al, 17 
aO, aC, at 

rol 

S4, $5, 2097153 

1 

This  instruction  generates  the  assembly 
error  "Shift  amount  not  D..31".  While  this 
is  reasonable  enough,  the  documentation 
maintains  that  shift  amounts  outside  the 
range  of  0..31  are  taken  modulo  32  be¬ 
fore  shifting,  thus  implying  that  this  line  of 
code  would  be  legal. 

ror 

$4,  $5 

subu 

sllv 

srlv 

or 

at, zero, al 
at , aO , at 
aO, aO, al 
aO, aO, at 

See  note  for  rol  instruction. 

ror 

$4, $5, $6 

subu 

sllv 

srlv 

or 

at, zero, a2 
at , al , at 
aO, al, a2 
aO, aO, at 

ror 

$4,$5,$0 

subu 

sllv 

srlv 

or 

at, zero, zero 
at ,  al ,  at 
aO, al, zero 
aO, aO, at 

The  assembler  reorganizer  should  trans¬ 
late  this  instruction  to  a  move  $4,  $5, 
since  a  rotation  by  zero  bits  is  no  rotation 
at  all.  Instead  It  incorrectly  generates  the 
superfluous  rotation  code. 

ror 

$4,0 

This  instruction  does  not  assemble  at  all 
and  generates  the  assembler  run-time 
error  "(fimmed  >•  0)  and  (fimmed  31)' 
from  .yaslemit.p,  line  588.  The  correct 
action  would  be  to  ignore  this  instruction. 

ror 

$4, $5,0 

1 

1 

1 

This  instruction  does  not  assemble  at  all 
generates  the  assembler  run-time  error 
"(fimmed  >»  0)  and  (fimmed  <-  31)"  from 
.yastemitp,  line  588.  The  correct  action 
would  be  to  ignore  this  instruction. 
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ror 

$4, $0, $5 

1  subu 
sllv 

srlv 

at, zero, al 
at , zero, at 
aO , zero, al 
aO , aO, at 

1  This  instruction  should  be  recoded  as 
move  a0,zero,  since  rotating  zero  by 
any  number  of  bits  (especially  zero  bits) 
still  yields  zero. 

IT  O  JT 

$4,  $0 

subu 

sllv 

srlv 

or 

at , zero, zero 
at ,  aO ,  at 
aO, aO, zero 
a0,a0,at 

This  instruction  should  be  recoded  as 
move  aO,zero,  since  rotating  zero  by 
any  number  of  bits  (especially  zero  bits) 
still  yields  zero. 

ror 

$4, $5, 15 

srl 

sll 

1  or 

at, al, 15 
aO, al, 17 
aO, aO, at 

ror 

S4, $5, 2097153 

This  instruction  generates  the  assembly 
error  "Shift  amount  not  0..31."  While  this 
is  reasonable  enough,  the  documentation 
maintains  that  shift  amounts  outside  of 
the  range  of  0..31  are  taken  modulo  32 
before  shifting,  thus  implying  that  this  line 
of  code  would  be  legal. 

seq 

$4,  $5 

xor 

sltiu 

1 

aO,  aO, al 
aO, aO, 1 

The  Mips  M/500  native  instruction  set 
does  not  have  an  seq  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 

seq 

$4, $5, S6 

xor 

sltiu 

aO, al, a2 
aO, aO, 1 

seq 

$4, $5, $0 

xor 

sltiu 

1 

aO,al, zero 
aO, aO, 1 

The  assembler  reorganizer  once  again 
misses  the  fact  that  the  zero  register  is 
functionally  equivalent  to  the  constant 
value  zero. 

seq 

$4, S5, 0 

sltiu 

1 

a0,al, 1 

1 

seq 

$4,  SO 

xor 

1  sltiu 

aO, aO, zero 
a0,a0, 1 

The  eissembler  reorganizer  once  again 
misses  the  fact  that  the  zero  register  is 
functionally  equivalent  to  the  constant 
value  zero. 

seq 

$4,0 

sltiu 

a0,a0, 1 

1 

seq 

$4,$0,$5 

xor 

sltiu 

aO, zero, al 
aO, aO, 1 

seq 

$4, $5, 15 

xori 

sltiu 

aO , al , Oxf 
a0,a0, 1 

seq 

$4, $5,2097153 

lui 

ori 

xor 

sltiu 

at, 0x20 
at , at , 0x1 
a0,al,at 
a0,a0,l 

sit 

$4, $5 

sit 

a0,a0,al 

sit 

$4, $5, $6 

sit 

a0,al,a2 

sit 

$4,$5,$0 

sit 

a0,al, zero 
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sit  $4, $5,0 

slti  a0,al,0 

sit  $4,$0 

sit  aO, aO, zero 

sit  $4,0 

slti  aO, aO, 0 

sit  $4,$0,$5 

sit  aO,zero,al 

sit  $4, $5, 15 

slti  a0,al,15 

Sit  $4, $5,2097153 

lui  at, 0x20 

ori  at , at , 0x1 

sit  aO,al,at 

situ  $  4 , $  5 

situ  aO,aO,al 

situ  $4 , $5 , $6 

situ  a0,al,a2 

situ  $4,$5,$0 

situ  aO,al,zero 

situ  $4 , $5 , 0 

sltiu  a0,al,0 

situ  $4,$0 

situ  aO , aO , zero 

situ  $4 , 0 

sltiu  a0,a0,0 

situ  $4,$0,$5 

situ  aO,zero,al 

situ  $4, $5, 15 

sltiu  a0,al,15 

situ  $4, $5, 2097153 

lui  at, 0x20 

ori  at, at, 0x1 

situ  aO,al,at 

sle  $4, $5 

sit  aO,al,aO 

xori  a0,a0,0xl 

1 

The  Mips  M/500  native  instruction  set 
does  not  have  an  sle  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode.  We  would  like  to  point  out  that 
other  architectures  usually  require  sever¬ 
al  instructions  to  set  condition  codes  and 
test  them.  The  scheme  that  Mips  uses  is 
actually  better,  in  spite  of  the  occasional 
code  expansion. 

sle  $4, $5, $6 

sit  a0,a2,al 

xori  a0,a0,0xl 

sle  $4,$5,$0 

sit  a0,zero,al 

xori  a0,a0,0xl 

The  assembler  reorganizer  once  again 
misses  the  fact  that  the  zero  register  is 
functionally  equivalent  to  the  constant 
value  zero. 

sle  $4, $5,0 

slti  a0,al,l 

sle  $4,$0 

sit  a0,zero,a0 

xori  aO , aO , 0x1 

The  assembler  reorganizer  once  again 
misses  the  fact  that  the  zero  register  is 
functionally  equivalent  to  the  constant 
value  zero. 

sle  $4,0 

slti  a0,a0,l 

sle  $4,$0,$5 

sit  a0,al,zero 

xori  a0,a0,0xl 

CMU/SEI-87-TR-29 


177 


r 


Assembler  Input  Machine  Language  Output  Comments 


sle  $4, $5, 15 

slti  a0,al,16 

« 

sle  $4, $5, 2097153 

lui  at, 0x20 

ori  at , at , 0x2 

sit  a0,al,at 

s leu  $4 , $  5 

situ  a0,al,a0 

xori  a0,a0,0xl 

The  Mips  M/500  native  instruction  set  0 

does  not  have  an  sleu  instruction,  so  it 

is  faked  with  two  other  instructions,  effec,- 

tively  doubling  the  execution  time  of  this 

opcode. 

sleu  $4, $5, $6 

situ  a0,a2,al 

xori  a0,a0,0xl 

• 

sleu  $4 , $5 , $0 

situ  a0,zero,al 

xori  a0,a0,0xl 

The  assembler  reorganizer  once  again 
misses  the  fact  that  the  zero  register  is 
functionally  equivalent  to  the  constant 
value  zero. 

sleu  $4, $5,0 

sltiu  a0,al,l 

m 

sleu  $4 , $0 

situ  a0,zero,a0 

xori  a0,a0,0xl 

The  assembler  reorganizer  once  again 
misses  the  fact  that  the  zero  register  is 
functionally  equivalent  to  the  constant 
value  zero. 

sleu  $4,0 

sltiu  a0,a0,l 

A 

sleu  $4,$C,$5 

situ  a0,al,zero 

xori  a0,a0,0xl 

w 

sleu  $4, $5, 15 

sltiu  a0,al,16 

sleu  $4, $5, 2097153 

lui  at, 0x20 

ori  at , at , 0x2 

situ  a0,al,at 

• 

sgz  $4 , $5 

sit  a0,al,a0 

The  Mips  M/500  native  instruction  set 
does  not  have  an  sgt  instruction,  so  it  is 
faked  with  an  sit  instruction  with 
reversed  operands  at  no  extra  cost. 

sgt  $4 , $5 , $6 

sit  a0,a2,al 

# 

sgt  $4,$5,$0 

sit  a0,zero,al 

sgt  $4, $5, 0 

sit  a0,zezo,al 

sgt  $4,$0 

sit  a0,zero,a0 

sgt  $4 , 0 

sit  a0,zero,a0 

• 

sgt  $4 , $0 , $5 

sit  a0,al,zero 

sgt  $4, $5, 15 

li  at, 15 

sit  a0,at,al 

sgt  $4,35,2097153 

lui  at, 0x20 

ori  at ,  at ,  0x1 

sit  a0,at,al 

• 
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sgxu 

$4,  $5 

situ 

aO, al, aO 

The  Mips  M/500  native  instruction  set 
does  not  have  an  sgtu  instruction,  so  it 
is  faked  with  an  situ  instruction  with 
reversed  op>erands  at  no  extra  cost. 

sgtu 

situ 

aO, a2, al 

• 

sgtu 

S4, S5, SO 

situ 

aO, zero, al 

sgtu 

S4, $5, 0 

situ 

aO, zero, al 

sgnu 

S4,  SO 

situ 

aO, zero, aO 

sgtu 

S4, 0 

situ 

aO, zero, aO 

• 

sgtu 

S4, SO, S5 

situ 

aO, al, zero 

sgtu 

S4, S5, 15 

11 

situ 

at,  15 
aO,at,al 

• 

sgtu 

S4, S5, 2097153 

lui 

ori 

situ 

at, 0x20 
at , at , 0x1 
aO, at, al 

sge 

S4,  S5 

'  sit 
xori 

aO, aO, al 
aO, aO, 0x1 

The  Mips  M/500  native  instruction  set 
does  not  have  an  sge  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 

• 

sge 

S4, S5, $6 

:  sit 
xori 

aO, al, a2 
aO, aO, 0x1 

sge 

S4, S5, SO 

sit 

xori 

aO, al, zero 
aO, aO, 0x1 

• 

sge 

S4, $5, 0 

slti 

xori 

aO, al , 0 
aO, aO, 0x1 

sge 

S4,  SO 

sit 

xori 

aO, aO, zero 
aO, aO, 0x1 

sge 

S4, 0 

slti 

xori 

aO,  aO,  0 
aO, aO, 0x1 

sge 

S4, SO, $5 

sit 

xori 

aO, zero, al 
aO, aO, 0x1 

sge 

$4, $5, 15 

1  slti 
xori 

a0,al,15 
a0,a0, 0x1 

• 

sge 

$4, S5, 2097153 

lui 

ori 

sit 

xori 

at, 0x20 
at , at , 0x1 
aO ,  al , at 
aO,aO,  0x1 

• 

sgeu 

$4, $5 

situ 

xori 

aO,aO,al 
aO , aO , 0x1 

The  Mips  M/500  native  instaiction  set 
does  not  have  an  sgeu  instruction,  so  it 
is  faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 
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sgeu  $4, $5, $6 

situ  a0,al,a2 

xori  a0,a0,0xl 

sgeu  S4 , $  5 , $  0 

situ  aO,al,zero 

xori  a0,a0,0xl 

All  unsigned  numbers  are  greater  than  0, 
so  this  instruction  should  simply  expand 
to  xori  a0,a0,0xl.  Instead,  it  is  ex¬ 
panded  to  a  code  sequence  that,  while 
functionally  correct,  takes  twice  as  long 
to  execute. 

sgeu  $4 , 55 , 0 

sltiu  a0,al,0 

xori  a0,a0,0xl 

Suboptimal  code  (see  above). 

sgeu  54,30 

situ  aO,aO,zero 

xori  a0,a0,0xl 

Suboptimal  code  (see  above). 

sgeu  54,0 

sltiu  a0,a0,0 

xori  a0,a0,0xl 

Suboptimal  code  (see  above). 

sgeu  $4,50,55 

situ  a0,zero,al 

xori  a0,a0,0xl 

Suboptimal  code  (see  above). 

sgeu  54,55,15 

sltiu  a0,al,15 

xori  a0,a0,0xl 

sgeu  54,55,2097153 

lui  at, 0x20 

ori  at, at, 0x1 

situ  a0,al,at 

xori  a0,a0,0xl 

sne  54,55 

xor  a0,a0,al 

sltiu  a0,a0,l 

xori  a0,a0,0xl 

Not  only  does  the  Mips  M/500  native  in¬ 
struction  set  does  not  have  an  sne  in¬ 
struction  (so  that  it  fakes  rt  with  three 
other  instructions,  effectively  tripling  the 
execution  time  of  this  opcode)  but  it 
generates  the  wrong  code  sequence! 
What  should  be  generated  is 
xor  a0,a0,al  followed  by 

situ  a0,zero,a0,  which  takes  only 
two  cycles  to  execute. 

sne  54,55,56 

xor  a0,al,a2 

sltiu  a0,a0,l 

xori  a0,a0,0xl 

Suboptimal  code  (see  above). 

sne  54 , 55 , 50 

xor  a0,al,zero 

sltiu  a0,a0,l 

xori  a0,a0,0xl 

The  assembler  reorganizer  once  again 
misses  the  fact  that  the  zero  register  is 
functionally  equivalent  to  the  constant 
value  zero.  It  is  also  generating  sub¬ 
optimal  code  (see  above). 

sne  $4,55,0 

sltiu  a0,al,l 

xori  aO , aO , 0x1 

Suboptimal  code  (see  above). 

sne  $4,$0 

xor  a0,a0,zaro 

sltiu  a0,a0,l 

xori  aO , aO , 0x1 

Suboptimal  code  (see  above). 

sne  $4,0 

sltiu  a0,a0,l 
xori  a0,a0,0xl 

Suboptimal  code  (see  above). 
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sne 

$4, $0, $5 

xor 

sltiu 

xori 

aO, zero, al 
aO, aO, 1 
aO , aO, 0x1 

Suboptimal  code  (see  above). 

sne 

S4,$5,15 

xori 

sltiu 

xori 

aO, al, 0x£ 
aO, aO, 1 
aO, aO, 0x1 

Suboptimal  code  (see  above). 

sne 

$4,55,2097153 

lui 

ori 

xor 

sltiu 

xori 

at, 0x20 
at, at, 0x1 
aO, al, at 
aO, aO, 1 
aO, aO, 0x1 

Suboptimal  code  (see  above). 

sll 

$4,  $5 

sll<r 

aO, aO, al 

sll 

$4, $5, $6 

sllv 

aO, al, a2 

sll 

$4, $5, $0 

sllv 

! 

1 

aO, al, zero 

This  instruction  could  be  substituted  with 
a  simple  move  instruction,  since  a  shift  of 
zero  bits  is  no  shift  at  all.  However,  since 
both  instructions  take  one  cycle,  there  is 
no  extra  incurred  expense. 

sll 

$4, $5, 0 

sll 

aO, al, 0 

sll 

$4,$C 

sllv 

aO, aO, zero 

sll 

$4,0 

sll 

aO, aO, 0 

sll 

$4, $0, $5 

sllv 

aO, zero, al 

This  instruction  should  be  recoded  as 
move  a0,zero,  since  rotating  zero  by 
any  number  of  bits  (especially  zero)  still 
yields  zero. 

sll 

$4, $5, 15 

sll 

a0,al, 15 

sll 

$4, $5, 2097153 

1 

This  instruction  generates  the  assembly 
error  "Shift  amount  not  0..31.'  While  this 
is  reasonable  enough,  the  aooumentation 
maintains  that  shift  amounts  outside  of 
the  range  of  0..31  are  taken  modulo  32 
before  shifting,  thus  implying  that  this  line 
of  code  would  be  legal. 

sra 

$4,  $5 

srav 

a0,a0,al 

sra 

$4, $5, $6 

srav 

aO, al, a2 

sra 

$4,$5,$0 

srav 

a0,al, zero 

This  instruction  could  be  substituted  with 
a  simple  move  instruction,  since  a  shift  of 
zero  bits  is  no  shift  at  all.  However,  since 
both  instructions  take  one  cycle,  there  is 
no  extra  incurred  expense. 

sra 

$4, $5,0 

sra 

aO, al, 0 

sra 

$4,  $0 

srav 

aO,aO,zero 

sra 

$4,0 

sra 

a0,a0,0 
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sra 


srl 

srl 


sru. 


sr. 


srl 


srl 

srl 


sub 


sub 


sub 
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$4, $0, $5 


srav 


$4, $5, 15 
$4, $5, 2097153 


sra 


$4, 


$5 


srlv 


S4, $5, $6 


srlv 


$4, $5, $0 


srlv 


$4, $5, 0 
$4,  $0 
$4,0 

$4, SO, $5 


srl 

srlv 

srl 

srlv 


$4, $5, 15 
$4, $5, 2097153 


srl 


$4,  $5 
$4, $5, $6 
$4,$5,$0 


sub 


sub 


sub 


aO, zero, al 


aO,al, 15 


aO, aO, al 
aO, al, a2 
aO, al, zero 


aO, al, 0 
a0,a0, zero 
aO, aO, 0 
aO , zero, al 


aO , al , 15 


aO, aO, al 
aO, al, a2 
aO, al, zero 


This  instruction  should  be  recoded  as 
move  aO,zero,  since  rotating  zero  by 
any  number  of  bits  (especially  zero)  still 
yields  zero. 


This  instruction  generates  the  assembly 
error  "Shift  amount  not  0..31."  While  this 
is  reasonable  enough,  the  documentation 
maintains  that  that  shift  amounts  outside 
of  the  range  of  0..31  are  taken  modulo  32 
before  shifting,  thus  implying  that  this  line 
of  code  would  be  legal. 


This  instruction  could  be  substituted  with 
a  simple  move  instruction,  since  a  shift  of 
zero  bits  is  no  shift  at  all.  However,  since 
both  instructions  take  one  cycle,  there  is 
no  extra  incurred  expense. 


This  instruction  should  be  recoded  as 
move  a0,zero,  since  rotating  zero  by 
any  number  of  bits  (especially  zero)  still 
yields  zero. 


This  instruction  generates  the  assembly 
error  "Shift  amount  not  0..31."  While  this 
is  reasonable  enough,  the  documentation 
maintains  that  that  shift  amounts  outside 
of  the  range  of  0..31  are  taken  modulo  32 
before  shifting,  thus  implying  that  this  line 
of  code  would  be  legal. 


This  instruction  could  be  substituted  with 
a  simple  move,  since  subtracting  zero 
from  a  number  gives  that  number  as  a 
result.  However,  since  both  instructions 
take  one  cycle,  there  is  no  extra  incurred 
expense. 


( 
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sub 

$4, $5,0 

sub 

$4,  $0 

sub 

$4,0 

sub 

$4, SO, $5 

sub 

$4,35,32767 

sub 

$4, $5, 32768 

sub 

$4,35,-32767 

sub 

$4, $5, -32768 

sub 

$4, $5, 15 

sub 

$4, $5,2097153 

subu 

$4, $5 

subu 

$4, $5, $6 

subu 

$4,$5,$0 

subu 

$4, $5,0 

Machine  Language  Output  Comments 


addi 

aO, al, 0 

This  instruction  could  be  substituted  with 
a  simple  move,  since  subtracting  zero 
from  a  number  gives  that  number  as  a 
result.  However,  since  both  instructions 
take  one  cycle,  there  is  no  extra  incurred 
expense. 

sub 

a0,a0, zero 

When  both  the  minuerxt  and  the  sub¬ 
trahend  of  the  subtraction  are  the  same, 
the  assembler  reorganizer  should 
remove  the  instruction  when  the  sub¬ 
trahend  is  zero.  As  can  be  seen,  it  does 
not. 

addi 

o 

o 

o 

(9 

When  both  the  minuend  and  the  sub¬ 
trahend  of  the  subtraction  are  the  same, 
the  assembler  reorganizer  should 
remove  the  instruction  when  the  sub¬ 
trahend  is  zero.  As  can  be  seen,  it  does 
not. 

sub 

aO, zero, al 

addi 

aO,al, -32767 

Subtraction  of  constant  values  is  imple¬ 
mented  as  the  addition  of  their  negative 
value. 

li 

sub 

at, 32766 
aO, al, at 

Unfortunately,  the  assembler  reorganizer 
is  not  smart  enough  to  recognize  that 
-32768  would  be  only  1 6  bits  long.  What 
should  be  generated  here  is 
addi  aO, al, -32768. 

addi 

a0,al, 32767 

The  negatives  of  the  values  are  used  for 
both  positive  arKf  negative  constants. 

li 

sub 

at, -32768 
aO, al, at 

The  assembler  reorganizer  /s  smart 
enough  to  know  that  32768  is  too  big  for 
an  immediate  operand  though. 

addi 

a0,al, -15 

The  use  of  the  addi  instruction  instead 
of  the  anticipated  subi,  while  entirely 
legal,  suggests  a  lack  of  orthogonality  of 
the  Mips  M/500  native  instruction  set.  In 
this  case,  this  is  perfectly  reasonable 
(since  the  Mips  M/500  native  architecture 
is  RISC  in  nature). 

lui 

ori 

sub 

at, 0x20 
at,  at,  0x1 
aO , al , at 

subu 

a0,a0,al 

subu 

aO,al, a2 

subu 

a0,al, zero 

add.l>x 

a0,al,0 
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subu  $4 , $0 

siibu  a0,a0,zero  ' 

This  instruction  does  nothing  and  should 
be  elided  by  the  assembler  reorganizer. 

subu  $4,0 

1 

addiu  a0,a0,0 

This  instruction  does  nothing  and  should 
be  elided  by  the  assembler  reorganizer. 

subu  $4, $0, $5 

svibu  a0,zexo,al 

subu  $4, $5, 15 

addiu  a0,al,-lS 

subu  $4, $5, 2097153 

lui  at , 0x20 

ori  at , at , 0x1 

subu  a0,al,at 

move  $4 , S5 

move  aO ,  al 

• 

mult  $4 , $5 

mult  aO ,  al 

mult  $4 , $0 

mult  aO , zero 

This  instruction  could  be  replaced  with 
move  aO,zero,  but  since  other  instruc¬ 
tions  may  be  counting  on  the  contents  of 
the  bi  and  lo  registers  aftenward,  this 
cannot  be  done.  (The  mult  instruction  is  ® 

documented  as  leaving  the  results  of  the 
multiplication  in  these  registers.) 

multu  $4 , $5 

multu  a0,al 

multu  $4 , $0 

multu  a0,zero 

This  instruction  could  be  replaced  with 

move  a0,zero,  but  since  other  instruc-  ^ 

tions  may  be  counting  on  the  contents  of 

the  hi  and  lo  registers  afterwards,  this 

cannot  be  done.  (The  multu  instruction 

is  documented  as  leaving  the  results  of 

the  multiplication  in  these  registers.) 

b  TOP 

b  0 

nop 

i 

The  trailing  nop  instructions  that  follow  # 

each  of  these  condition  tests  may  be 
filled  with  an  instruction  that  the  as¬ 
sembler  reorganizer  can  move 

downward. 

beq  $4, $5, TOP 

beq  a0,al,0 

nop 

# 

beq  $4,0, TOP 

beq  a0,zero,0 

nop 

In  this  case,  the  assembler  reorganizer 
correctly  treats  the  zero  register  and  the 
constant  value  0  as  identical. 

beq  $4,$0,TOP 

beq  aO, zero, 0 

nop 

• 

beq  $4, 15, TOP 

li  at, 15 

beq  aO , at , 0 

nop 

None  of  the  conditional  branches  sup¬ 
ports  an  immediate  operand,  so  the  as¬ 
sembler  reorganizer  toads  the  immediate 
operand  into  the  temporary  register  at. 

beq  $4, 2097153, TOP 

lui  at , 0x20 

ori  at, at,  0x1 

beq  aO , at , 0 

nop 

• 
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bgt 


bgt 


bgt 


bgt 


bge 


bge 


bge 


bge 


bge 


bgeu 


Assembler  Input  Machine  Language  Output  Comments 


$4, $5, TOP 

sit  at,al,aO 

bne  at, zero, 0 

nop 

The  Mips  M/500  native  instruction  set 
does  not  have  a  bgt  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 

$4,0, TOP 

bgtz  a0,0 

nop 

In  this  case,  the  use  of  the  Mips  M/500 
native  bgtz  instruction  keeps  the  effec¬ 
tive  execution  time  in  line  with  the  an¬ 
ticipated  time. 

$4, $Q,TOP 

bgtz  a0,0 

nop 

In  this  case,  the  assembler  reorganizer 
correctly  treats  the  zero  register  and  the 
constant  value  0  as  identical. 

$4, 15, TOP 

slti  at,a0,16 

beq  at, zero, 0 

nop 

None  of  the  conditional  branches  sup¬ 
ports  an  immediate  operand,  so  the  as¬ 
sembler  reorganizer  loads  the  immediate 
operand  into  the  temporary  register  at. 

$4, 2097153, TOP 

lui  at, 0x20 

ori  at , at , 0x2 

sit  at,a0,at 

beq  at, zero, 0 

nop 

$4, $5, TOP 

sit  at,a0,al 

beq  at, zero, 0 

nop 

The  Mips  M/500  native  instruction  set 
does  not  have  a  bge  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 

$4, 0,TOP 

bgez  a0,0 

nop 

In  this  case,  the  use  of  the  Mips  M/500 
native  bgez  instruction  keeps  the  effec¬ 
tive  execution  time  in  line  with  the  an¬ 
ticipated  time. 

$4, $0,TOP 

bgez  a0,0 

nop 

In  this  case,  the  assembler  reorganizer 
correctly  treats  the  zero  register  and  the 
constant  value  0  as  identical. 

$4, 15, TOP 

slti  at,a0,15 

beq  at, zero, 0 

nop 

None  of  the  conditional  branches  sup¬ 
ports  an  immediate  operand,  so  the  as¬ 
sembler  reorganizer  loads  the  immediate 
operand  into  the  temporary  register  at. 

$4, 2097153, TOP 

lui  at, 0x20 

ori  at , at , 0x1 

sit  at,a0,at 

beq  at, zero, 0 

nop 

$4,  $5, TOP 

situ  at,a0,al 

beq  at, zero, 0 

nop 

The  Mips  M/500  native  instruction  set 
does  not  have  a  bgeu  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 
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bgeu 


bgeu 


bgeu 


bgeu 


bgtu 


bgtu 


bgtu 


bgtu 


bgtu 


bit 


bit 
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$4, 0, TOP 

b  0 

nop 

All  numbers  are  greater  than  or  equal  to 
zero  in  unsigned  comparisons,  so  the  as¬ 
sembler  reorganizer  has  correctly  trans¬ 
lated  the  conditional  branch  into  an  un¬ 
conditional  branch  instruction. 

$4,$0,TOP 

b  0  ' 

nop 

In  this  case,  the  assembler  reorganizer  ^ 

correctly  treats  the  zero  register  and  the 
constant  value  0  as  identical. 

$4, 15, TOP 

sltiu  at,a0,15 

beq  at, zero, 0 

nop 

None  ot  the  conditional  branches  sup¬ 
ports  an  immediate  operand,  so  the  as¬ 
sembler  reorganizer  loads  the  immediate  _ 

operand  into  the  temporary  register  at.  ^ 

$4, 2097153, TOP 

lui  at, 0x20 

ori  at ,  at ,  0x1 

situ  at,aO,at 

beq  at, zero, 0 

nop 

• 

$4,  $5, TOP 

situ  at,al,a0 

bne  at, zero, 0 

nop 

The  Mips  M/500  native  iristruction  set 
does  not  have  a  bgtu  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 

$4,0, TOP 

bne  aO, zero, 0 

nop 

In  unsigned  comparisons,  all  numbers  ® 

are  either  greater  than  or  equal  to  zero. 

Since  we  are  concerned  with  numbers 
that  are  greater  than  zero,  the  assembler 
reorganizer  tests  for  not  equal  to  zero, 
which  suffices. 

$4, $0, TOP 

bne  a0,zero,0 

nop 

In  this  case,  the  assembler  reorganizer  ^ 

correctly  treats  the  zero  register  and  the 
constant  value  0  as  identical. 

$4, 15, TOP 

sltiu  at,a0,16 

beq  at, zero, 0 

nop 

None  of  the  conditional  branches  sup¬ 
ports  an  immediate  operand,  so  the  as¬ 
sembler  reorganizer  loads  the  immediate  ^ 

operand  into  the  temporary  register  at.  ^ 

$4,2097153, TOP 

lui  at, 0x20 

ori  at , at , 0x2 

situ  at,a0,at 

beq  at, zero, 0 

nop 

• 

$4, $5, TOP 

alt  at,a0,al 

bne  at, zero, 0 

nop 

The  Mips  M/500  native  instruction  set 
does  not  have  a  bit  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 

$4,0, TOP 

bits  a0,0 

nop 

In  this  case,  the  use  of  the  Mips  M/500  9 

native  bltz  instruction  Keeps  the  effec¬ 
tive  execution  time  in  line  with  tne  an¬ 
ticipated  time. 
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bit 

$4, $0,TOP 

bltz 

nop 

o 

o 

In  this  case,  the  assembler  reorganizer 
correctly  treats  the  zero  register  and  the 
constant  value  0  as  identical. 

bit 

$4,  15, TOP 

9 

at, aO, 15 
at, zero, 0 

None  of  the  conditional  branches  sup¬ 
ports  an  immediate  operand,  so  the  as¬ 
sembler  reorganizer  loads  the  immediate 
operand  into  the  temporary  register  at. 

bit 

$4, 2097153, TOP 

lui 

ori 

sit 

bne 

nop 

at, 0x20 
at , at , 0x1 
at ,  aO ,  at 
at, zero, 0 

ble 

$4, $5, TOP 

sit 

beq 

nop 

at , al , aO 
at, zero, 0 

The  Mips  M/500  native  instruction  set 
does  not  have  a  ble  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 

ble 

$4, 0, TOP 

blez 

nop 

o 

o 

In  this  case,  the  use  of  the  Mips  M/500 
native  blez  instruction  keeps  the  effec¬ 
tive  execution  time  in  line  with  the  an¬ 
ticipated  time. 

ble 

$4, $0,TOP 

blez 

nop 

o 

o 

In  this  case,  the  assembler  reorganizer 
correctly  treats  the  zero  register  and  the 
constant  value  0  as  identical. 

ble 

$4, 15, TOP 

slti 

bne 

nop 

at , aO, 16 
at, zero, 0 

None  of  the  conditional  branches  sup¬ 
ports  an  immediate  operaixf,  so  the  as¬ 
sembler  reorganizer  loads  the  immediate 
operand  into  the  temporary  register  at. 

ble 

$4, 2097153, TOP 

lui 

ori 

sit 

bne 

nop 

at, 0x20 
at , at , 0x2 
at ,  aO ,  at 
at, zero, 0 

bleu 

$4, S5, TOP 

situ 

beq 

nop 

at, al, aO 
at, zero, 0 

The  Mips  M/500  native  instruction  set 
does  not  have  a  bleu  instruction,  so  it  is 
faked  with  two  other  instructions,  effec¬ 
tively  doubling  the  execution  time  of  this 
opcode. 

bleu 

$4, 0,TOP 

beq 

nop 

aO, zero, 0 

In  unsigned  comparisons,  all  numbers 
are  either  greater  than  or  equal  to  zero. 
Since  we  are  concerned  with  numbers 
that  are  less  than  or  equal  to  zero,  the 
assembler  reorganizer  tests  for  equal  to 
zero,  which  suffices. 

bleu 

$4, $0,TOP 

beq 

nop 

aO, zero,  0 

In  this  case,  the  assembler  reorganizer 
correctly  treats  the  zero  register  and  the 

constant  value  0  as  identical. 
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bleu  $4, 15, TOP 

sltiu  at,a0,l€ 

bne  at, zero, 0 

nop 

None  of  the  conditional  branches  sup-  ® 

ports  an  immediate  operand,  so  the  as¬ 
sembler  reorganizer  loads  the  immediate 
operand  into  the  temporary  register  at. 

bleu  $4, 2097153, TOP 

lui  at, 0x20 

ori  at ,  at ,  0x2 

situ  at,a0,at 

bne  at, zero, 0 

nop 

• 

bltu  $4, $5, TOP 

situ  at,a0,al 

bne  at, zero, 0 

nop 

The  Mips  M/500  native  instruction  set 

does  not  have  a  bltu  instruction,  so  it  is 

faked  with  two  other  instructions,  effec-  ^ 

tively  doubling  the  execution  time  of  this 

opcode. 

bltu  $4,0, TOP 

This  instruction  generates  no  code  at  all. 

This  is  correct  behavior,  since  no  number 
may  be  less  than  0  in  an  unsigned  com¬ 
parison,  so  the  branch  can  never  be  ^ 

taken.  If  the  branch  instruction  is  ad¬ 
dressed  by  a  label,  and  hence  possibly 
the  target  of  a  branch,  the  assembler 
reorganizer  substitutes  a  nop  instruction 
for  the  bltu. 

bltu  $4,$0,TOP 

This  instruction  generates  no  code  at  all.  ^ 

This  is  correct  behavior,  since  no  number 
may  be  less  than  0  in  an  unsigned  com¬ 
parison,  so  the  branch  can  never  be 
taken.  If  the  branch  instruction  is  ad¬ 
dressed  by  a  label,  and  hence  possibly 
the  target  of  a  branch,  the  assembler  ^ 

reorganizer  substitutes  a  nop  instruction 
for  the  bltu.  In  this  case  also,  the  as¬ 
sembler  reorganizer  correctly  treats  the 
zero  register  and  the  constant  value  0  as 
identical. 

bltu  $4, 15, TOP 

sltiu  at,a0,15 

bne  at, zero, 0 

nop 

None  of  the  conditional  branches  sup-  ^ 

ports  an  immediate  operand,  so  the  as¬ 
sembler  reorganizer  loads  the  immediate 
operand  into  the  temporary  register  at. 

bltu  $4, 2097153, TOP 

lui  at, 0x20 

ori  at , at , 0x1 

situ  at,a0,at 

bne  at, zero, 0 

nop 

• 

bne  $4, $5, TOP 

bne  a0,al,0 

nop 

bne  $4,0, TOP 

bne  a0,zerOf0 

nop 

• 
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Assembler  Input 

$4, $0, TOP 


Machine  Language  Output 


Comments 


bne 

$4,15, TOP 

bne 

$4, 2097153, TOP 

bal 

TOP 

bltzal 

TOP 

bgezal 

$4 

beqz 

S  4 , TOP 

bgez 

$4, TOP 

bgtz 

$4, TOP 

blez 

$  4 , TOP 

bltz 

$4, TOP 

bnez 

$4, TOP 

j 

TOP 

j 

$4 

jal 

TOP 

bne  a0,zerOr0 

nop 

In  this  case,  the  assembler  reorganizer 
correctly  treats  the  zero  register  and  the 
constant  value  0  as  identical. 

li  at, 15 

bne  a0,at,0 

nop 

None  of  the  corKditional  branches  sup¬ 
ports  an  immediate  operand,  so  the  as¬ 
sembler  reorganizer  loads  the  immediate 
operand  into  the  temporary  register  at. 

lui  at, 0x20 

ori  at , at , 0x1 

bne  a0,at,0 

nop 

bgezal  zero^O 
nop 

Apparently,  there  is  no  unconditional 
branch  and  link  instruction  in  the  Mips 
M/500  native  instruction  set,  so  the  as¬ 
sembler  reorganizer  substitutes  the  con¬ 
ditional  bgezal  instruction  with  an  al¬ 
ways  TRUE  condition. 

According  to  the  documentation,  this  in¬ 
struction  is  legal,  but  when  assembled, 
generates  the  error  "Register  expected: 
TOP".  It  would  seem  that  neither  the 
bltzal  nor  the  bgezal  instruction  func¬ 
tions  at  all. 

According  to  the  documentation,  this  in¬ 
struction  is  legal,  but  when  assembled, 
generates  the  error  "label  expected".  It 
would  seem  that  neither  the  bltzal  nor 
the  bgezal  instruction  functions  at  all. 

beq 

nop 

aO,  zero, 0 

bgez 

nop 

a0,0 

bgtz 

nop 

a0,0 

blez 

nop 

a0,0 

bltz 

nop 

a0,0 

bne 

nop 

aO, zero,  0 

j 

nop 

0 

nop 

aO 

nop 

0 
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Comments 


Iwcl  $f4,ADDR 


lwc2  $4,ADDR 


lwc3  $4,AJDDR 


SWcO  $4,ADDR 


swcl  Sf4,ADDR 


swc2  $4,ADDR 


swc3  $4,ADDR 


mf cO  $4 , $5 


mfcl  $4,$f5 


mfcl.d  $4,Sf6 


mfc2  $4, $5 


n\fc3  $4,  $5 


at ,  0 

aO, at, 7592 


at,  0 

aO,at, 7592 


aO , c0r5 


aO,  f5 


al,f6 

aO,f7 


Note  that  cOr5  refers  to  coprocessor  0 
register  5. 


This  instruction  is  undocumented  in  the 
Mips  Assembly  Language  Programmers 
Guide.  It  serves  to  store  a  double¬ 
precision  floating-point  number  from  the 
floating-point  co-processor  by  performing 
two  single-word  store  irwtructions. 


c2  aO, sero, 10240 

nop 


c3  aO, zero, 10240 
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mt  c  0  $  4  ,  S  5 

mteO  aOfCOrS 

nop 

mtcl  S4,$f5 

xntcl  a0,£5 

nop 

mtcl.d  $4,$f6 

mtcl  al,£€ 
mtcl  a0,£7 
nop 

This  instruction  is  undocumented  in  the 
Mips  Assembly  Language  Programmers 
Guide,  but  is  generated  by  the  compilers. 
It  serves  to  load  a  double-precision 
floating-point  numbe'  into  the  floating¬ 
point  co-processor  by  performing  two 
single-word  load  instructions. 

mtc2  $4 , $5 

c2  aO.aO, 10240 

nop 

mt  c  3  $  4  ,  $  5 

c3  a0,a0, 10240 

nop 

mBmm 

bcOf  0 

nop 

bclf  TOP 

bclf  0 

nop 

bc2f  TOP 

c2  zero, to, 0 

nop 

bc3f  TOP 

c3  zero, to, 0 

nop 

bcOt  TOP 

bcOt  0 

nop 

belt  TOP 

belt  0 

nop 

bc2t  TOP 

c2  at , to, 0 

nop 

bc3t  TOP 

c3  at,t0,0 

nop 

cO  15 

cO  cOoplS 

cl  15 

£op0£.s  £0,£0,£0 

The  disassembler  supplied  by  Mips  (and 
used  to  extract  the  machine-language 
output)  "knows*  that  co-processor  1  is 
the  floating-point  unit,  so  it  interprets  cl 

as  a  floating-point  instruction.  We  are 
not  sure  exactiy  what  this  instruction  is, 
though. 

c2 

15 

c2 

zero, sO, 15 

c3 

15 

c3 

zero, sO, 15 

cf  cO 

$4, $5 

This  irtstruction,  although  documented,  is 
not  recognized  by  the  assembler  reor¬ 
ganizer  as  being  legal. 
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cfcl  $4, $5 

cfcl  a0,f5  1 

nop 

cfc2  $4, $5 

This  instruction,  although  documented,  is 
not  recognized  by  the  assembler  reor¬ 
ganizer  as  being  legal. 

cfc3  S4,S5 

_ 1 

This  instruction,  although  documented,  is  ® 

not  recognized  by  the  assembler  reor- 
1  ganizer  as  being  legal. 

ctcO  $4, $5 

This  instruction,  although  documented,  is 
not  recognized  by  the  assembler  reor¬ 
ganizer  as  being  legal.  ^ 

ctcl  $4 , $5 

ctcl  a0,f5 

nop 

ctc2  $4 , $5 

This  instruction,  although  documented,  is 
not  recognized  by  the  assembler  reor¬ 
ganizer  as  being  legal. 

ctc3  $4 , $5 

This  instruction,  although  documented,  is 
not  recognized  by  the  assembler  reor¬ 
ganizer  as  being  legal. 

tlbp 

cO  tlbp 

tlfcr 

cO  tlbr 

• 

tlbwr 

cO  tlbwr 

tlbwi 

cO  tlbwi 

nop 

nop 

Although  undocumented,  this 

instruction's  function  should  be  obvious. 

l.s  Sf2,TOP 

lui  at , 0 

Iwcl  £2,0 (at) 

In  this  and  all  floating-point  load/store  0 

operations,  the  instructions  that  are  gen¬ 
erated  use  the  iwcl  and  swcl  instruc¬ 
tions.  These  instructions  use  a  general 
address  expression  for  their  second 
operand.  Therefore,  the  assembler  reor¬ 
ganizer  must  generate  a  load  instruction  ^ 

for  the  at  register,  even  if  the  resultant 
effective  address  will  be  a  simple  con¬ 
stant  value. 

l.d  $f2,TOP 

lui  at,0 

Iwcl  £2, 4 (at) 

Iwcl  £3,0  (at) 

nop 

Loading  a  double-precision  number  re¬ 
quires  two  Iwcl  instructions  to  load  all 

64  bits.  ^ 

3.S  $f2,TOP 

lui  at,0 

swcl  £2,0 (at) 

s.d  $f2,TOP 

lui  at ,  0 

swcl  £3,0 (at) 

swcl  £2, 4 (at) 

Storing  a  double-predsion  number  re¬ 
quires  two  Iwcl  instructions  to  store  ali  _ 

64  bits.  • 

abs . s  $£2,$£4 

abs.s  £2, £4 
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abs.c  $f2,Sf4 


nea . s  $  f 2 , $f  4 


add.s  Sf2,Sf4,$f6 


add .  d 


$f2, Sf4, Sf6 


sub.s  $f2,Sf4,$f6 


sub.d  $f2,$f4,$f6 


s  $f2,Sf4,$f6 


mul.d  $f2,$f4,$f6 


$f2, $f4, $f6 


Sf2, $f4, $f6 


d  $f2,Sf4 


s  $f2,$f4 


d  Sf2, Sf4 


w  Sf2,Sf4 


w  $f2, $f4 


s  $f2,$f 


trunc.w.s  Sf2,$f4,$4 


Machine  Language  Output 


Comments 


abs.d  f2,f4 


neg.s  £2,£4,f0 


neg.d  £2,£4,£0 


add.s  £2, £4, £6 


add.d  £2, £4, £6 


sub.s  £2, £4, £6 


sub.  d 


£2, £4, £6 


mul . s  £2, £4, £6 


mul.d  £2, £4, £6 


div.s  £2, £4, £6 


div.d  £2, £4, £6 


cv-t .  s  .  d  £2,  £4 


cvt.d.s  £2,  £4 


cvt.w.d  £2,  £4 


cvt.d.w  £2, £4 


cvt.s.w  £2,  £4 


cvt.w.s  £2, £4 


a0,£31 

aO,£31 

at , aO , 0x3 
at , at , 0x2 
at, £31 


ctcl 

nop 

cvt . w . s 

ctcl 

nop 

nop 

nop 


nop 

cvt .  w.d 

ctcl 

nop 

nop 

nop 


£2,  £4 
aO, £31 


a0,£31 

a0,f31 

at, aO, 0x3 
at, at, 0x2 
at, £31 

£2,  £4 
a0,£31 


The  neg  instruction  appears  to  need  an 
extra  register. 


The  neg  instruction  appears  to  need  an 
extra  register. 


Truncation  appears  to  be  a  rather  expen¬ 
sive  operation  (although  the  documenta¬ 
tion  does  describe  these  instructions  as 
being  "macro”  instructions). 


Truncation  appears  to  be  a  rather  expen¬ 
sive  operation  (although  the  documenta¬ 
tion  does  describe  these  instructions  as 
being  "macro"  instructions). 
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Machine  Language  Output 

Comments 

round. w.s 

cfcl 
cf  cl 

li 

and 

ctcl 

nop 

cvt .w.s 

ctcl 

nop 

nop 

nop 

aO, £31 
a0,f31 
at,  -4 
at , at , aO 
at, £31 

£2,  £4 
a0,£31 

Rounding  appears  to  be  a  rather  expen¬ 
sive  operation  (although  the  documenta¬ 
tion  does  describe  these  instructions  as 
being  "macro"  instructions). 

round. w.d 

cfcl 

cfcl 

li 

and 

ctcl 

nop 

cvt .w.d 

ctcl 

nop 

nop 

nop 

a0,£31 
a0,£31 
at,  -4 
at ,  at ,  aO 
at, £31 

£2,  £4 
a0,£31 

Rounding  appears  to  be  a  rather  expen¬ 
sive  operation  (although  the  documenta¬ 
tion  does  describe  these  instructions  as 
being  "macro"  instructions). 

c.f.S  Sf2,$f4 

C.f.S 

nop 

! 

1 

£2,  £4 

The  trailing  nop  instructions  that  follow 
each  of  these  condition  tests  may  be 
filled  wrth  an  instruction  that  the  as¬ 
sembler  reorganizer  can  move 

downward.  Note  that  the  Mips  M/500  na¬ 
tive  instruction  set  does  not  have  any 
floating-point  conditional  branches  per 
se,  but  instead  uses  the  belt  and  bclf 
instructions  (page  191)  to  branch  on  the 
condition  codes  set  by  these  relational 
operations. 

c.f.d  $f2,$f4 

c.f.d 

nop 

£2,  £4 

I 

c.un.s  $f2,$f4 

c.xin.  s 

nop 

£2,  £4 

c.un.d  $f2,$f4 

c.un.d 

nop 

£2,  £4 

c.eq. s  $f2, $f4 

.«q.  s 
nop 

£2,  £4 

c.eq.d  $f2,$f4 

c.eq.d 

nop 

£2,  £4 

c . ueq . s  $f 2 , $f 4 

c.ueq.  s 
nop 

£2,  £4 

c.ueq.d  Sf2, $f4 

c .  ueq .  d 
nop 

£2,  £4 

c.olt.s  $£2,$f4 

c.o''  t.s 
nop 

£2,  £4 

1 

194 


CMU/SEI-87-TR-25 


Assembler  Input 


c.olt .d  $f2, $f4 


c  .  ult . s  $f 2 , $f 4 


c.ult .d  $f2, $f4 


c  .ole . s  $f2, $f 4 


c.ole.d  $f2,$f4 


c . ule . s  $f 2 , $f 4 


c.ule.d  $f2, $f4 


c.sf.s  Sf2,$£4 


c.sf.d  $f2,$f4 


c . ngle . s  S£2 , $f 4 


c . ngle . d  S£2 , $  £4 


c . seq. s  S£2/ $f 4 


c.  seq.d  $£2, S£4 


c . ngl . s  S£2 , S£4 


c.ngl.d  $£2, $£4 


c.lt.s  $£2,S£4 


c.lt.d  $£2,Sf4 


c.nge.s  $£2,$f4 


c.nge.d  $£2,$f4 


c.le.s  $£2, $£4 


c.le.d  $f2,$£4 


Machine  Language  Output 


c .  olt .  d 
nop 

£2,  £4 

c .  ult .  s 
nop 

£2,  £4 

c . ult ,  d 
nop 

£2,  £4 

c.ole.s 

£2,  £4 

nop 

c .  ole .  d 
nop 

£2,  £4 

c .  ule .  s 
nop 

£2,  £4 

c.ule.d 

nop 

£2,  £4 

c.sf.s 

nop 

£2,  £4 

c.sf.d 

nop 

£2,  £4 

c . ngle . a 
nop 

£2,  £4 

c.ngle .d 
nop 

£2,  £4 

c.  seq.  8 
nop 

£2,  £4 

c.  seq.d 
nop 

£2,  £4 

c .  ngl .  8 
nop 

£2,  £4 

c .  ngl .  d 
nop 

£2,  £4 

c.lt.s 

£2,  £4 

nop 

c.lt.d 

nop 

£2,  £4 

c .  nge .  s 
nop 

£2,  £4 

c . nge . d 
nop 

£2,  £4 

c.le.s 

nop 

£2,  £4 

c.le.d 

nop 

£2,  £4 

Comments 


Assembler  Input 

Machine  Language  Output 

Comments 

c  .  ngt .  s  $f  2 ,  $  f  4 

1 

c.ngt .s  f2, f4 
nop 

c  .  ngt .  d  $f  2 ,  $  f  4 

c.ngt.d  f2,f4 
nop 

mov . s  $f2, $f 4 

mov.s  £2, £4 

mov . d  Sf2, $f 4 

niov.d  £2,  £4 

nop 

Table  A-2  (on  the  following  page)  provides  an  alphabetic  cross  reference  of  Mips  assembler  instruc¬ 
tions.  The  previous  table  was  listed  in  the  instruction  order  presented  in  Chapter  5  of  the  Mips 
Assembly  Language  Reference  Manual  [WPS  86a].  Table  A-2  is  supplied  to  provide  an  easy  mech¬ 
anism  for  locating  the  page  number  on  which  instructions  are  first  referenced. 
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Instruction  Page 


Instruction  Page 


Instruction  Page 


mhm 

159 

aba  .  d 

193 

aba .  a 

192 

add 

160 

add.d 

193 

add .  a 

193 

addu 

160 

and 

161 

b 

184 

bal 

189 

bcOf 

191 

bcOb 

191 

belf 

191 

belt 

191 

bc2f 

191 

bc2t 

191 

bc3f 

191 

bc3t 

191 

baq 

164 

baqt 

189 

bga 

185 

bgau 

185 

bqaz 

169 

bgazal 

189 

bgt 

185 

bgtu 

186 

bgtz 

189 

bla 

187 

bleu 

187 

blaz 

169 

bit 

166 

bltu 

168 

bltz 

189 

bit zal 

189 

bna 

188 

bnaz 

189 

break 

190 

e.aq.d 

194 

194 

e.  £  .d 

194 

O.f.B 

194 

c.la.d 

195 

o.la.a 

195 

o .  It .  d 

195 

a .  It .  a 

195 

o .  nge .  d 

195 

o.ng«.c 

195 

o.ngl .d 

195 

c .  ngl .  a 

195 

c . ngla . d 

195 

c . ngla . a 

195 

c . ngt . d 

196 

c . ngt . a 

196 

c . ola . d 

195 

c . ola . a 

195 

c.olt.d 

195 

c . olt . a 

194 

c . aaq . d 

195 

195 

c .  a£ .  d 

195 

c . a£ .a 

195 

e .  uaq .  d 

194 

c .  u«q .  • 

194 

c . ula . d 

195 

c.ula.a 

195 

e .  ult .  d 

195 

c . ult . a 

195 

c.un.d 

194 

c.tin.c 

194 

cO 

191 

cl 

191 

c2 

191 

c3 

191 

cfeO 

191 

efel 

192 

c£c2 

192 

e£c3 

192 

cteO 

192 

etcl 

192 

etc2 

192 

ctc3 

192 

ovt.d.a 

193 

ert .d.w 

193 

evt .  a .  d 

193 

ovt.a.w 

193 

ovt.w.d 

193 

cvt.a.a 

193 

dlv 

162 

div.d 

193 

dlv.  a 

193 

divu 

164 

j 

189 

j«l 

189 

l.d 

192 

l,m 

192 

la 

146 

lb 

147 

Ibu 

147 

Id 

150 

Ih 

148 

Ibu 

148 

11 

154 

lui 

154 

1* 

148 

IwcO 

190 

Iwcl 

190 

lwc2 

190 

lwc3 

190 

Iwl 

149 

Iwr 

149 

BtfoO 

190 

in£cl 

190 

m£cl .  d 

190 

ib£c2 

190 

m£c3 

190 

atfbl 

190 

mflo 

190 

mov.s 

196 

mov.d 

196 

inov» 

184 

mteO 

191 

ntcl 

191 

mtol .  d 

191 

iiitc2 

191 

mtc3 

191 

mtbl 

190 

mtlo 

190 

nul 

166 

aiul.d 

193 

aiul.a 

193 

anilo 

167 

nulou 

168 

aiult 

184 

atultu 

184 

nmg 

159 

oag.d 

193 

n*9.« 

193 

n«gu 

159 

nop 

192 

nor 

169 

not 

160 

or 

170 

ron 

171 

room 

173 

190 

rol 

174 

ror 

175 

round . w . d 

194 

round. w. a 

194 

• .  d 

192 

• .  • 

192 

■b 

154 

•d 

154 

aaq 

176 

aga 

179 

•gou 

179 

agt 

178 

agtu 

179 

•h 

155 

ala 

177 

alau 

178 

all 

181 

alt 

176 

altu 

177 

mnm 

180 

mxm 

181 

mzX 

00 

aub 

182 

aub.d 

193 

aub .  a 

193 

aubu 

183 

ow 

156 

awcO 

190 

awel 

190 

avc2 

190 

a<rc3 

190 

•wl 

155 

•wr 

156 

ayacall 

190 

tlbp 

192 

tlbr 

192 

tlbwl 

192 

tlbwr 

192 

trune . w . d 

193 

txunc . w . a 

193 

ulh 

151 

ulbu 

152 

ulw 

153 

uab 

157 

uow 

158 

xor 

165 

I 


Tabu  A-2: 


Alphabetic  Cross  Reference  of  Mips  Assembler  Instructions 
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Tables  A-3  and  A-4  are  a  list  of  the  actual  hardware  instructions  supported  by  the  Mips  M/500  and  its 
floating-point  co-processor,  respectively.  They  are  provided  to  give  the  reader  a  feel  for  the  real 
instruction  set  architecture,  rather  than  the  pseudo-instructions  presented  by  the  assembler  reor¬ 
ganizer.  Please  note  that  the  nop  and  move  instructions  are  really  just  special  cases  of  the  addu 
instruction. 


addi 

addiu 

addu 

and 

andi 

b 

bcOf 

bcOt 

belt 

belt 

bgez 

bgezal 

bgtz 

blez 

bitz 

bne 

break 

cO 

c2 

c3 

cfcl 

ctcl 

div 

divu 

i 

jal 

jair 

jr 

lb 

Ibu 

Ih 

Ihu 

li 

lui 

Iw 

IwcO 

Iwc2 

Iwc3 

Iwl 

Iwr 

mfcO 

mfcl 

mfhi 

mflo 

move 

mtcO 

mtcl 

mthi 

mtio 

mult 

muitu 

nop 

nor 

or 

ori 

sb 

sh 

sit 

siti 

sitiu 

situ 

sra 

srI 

srfv 

sub 

subu 

swcO 

swcl 

swc2 

swc3 

swi 

swr 

syscall 

xor 

xori 

Table  A-3:  Actual  Mips  M/500  Instruction  Set 


abs.d 

abs.s 

add.d 

add.s 

c.eq.d 

c.eq.s 

c.f.d 

c.f.s 

c.le.d 

c.le.s 

c.lt.d 

c.lt.s 

c.nge.d 

c.nge.s 

c.ngl.d 

c.ngl.s 

c.ngle.d 

c.ngle.s 

c.ngt.d 

c.ngt.s 

c.ole.d 

c.ole.s 

c.olt.d 

c.olt.s 

c.seq.d 

c.seq.s 

c.sf.d 

c.sf.s 

c.ueq.d 

c.ueq.s 

c.ule.d 

c.ule.s 

c.ult.d 

C-Ult.S 

c.un.d 

c.un.s 

cvt.d.s 

cvt.d.w 

cvt.s.d 

cvt.s.w 

cvt.w.d 

cvt.w.s 

div.d 

div.s 

mov.d 

mov.s 

mul.d 

mul.s 

neg.d 

neg.s 

sub.d 

sub.s 

Table  A-4:  Mips  M/500  Floating-Point  Co-Processor  Instruction  Set 
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Appendix  B:  Compiler  and  Assembler  Version 
Information 

The  following  three  tables  list  the  version  numbers  ot  the  compilers,  assembler,  and  linker  used  to 
generate  all  of  the  infonmation  in  this  report.  The  version  information  was  obtained  by  running  the 
three  compilers  (C,  FORTRAN,  and  Pascal)  with  the  -v  switch  and  no  source  file.  The  subcom¬ 
ponents  of  the  compilers  and  libraries  are  also  listed,  and  are  primarily  from  Berkeley  release  soft¬ 
ware.  The  compiler  components  were  created  at  Mips  on  January  29,  1987,  and  installed  at  the 
Software  Engineering  Institute  on  March  20,  1 986.  All  of  the  test  results  describe  in  this  document 
were  obtained  after  that  installation  date. 


B.1.  C  Compiler 


I 


C  Compiler  Components 

CompUer  Component  Version  Number 


/usr/lib/cpp 
/ usr/lib/ ccom 
/usr/lib/u join 
/usr/bin/uld 
/usr/lib/usplit 
/ usr/lib/umerge 
Idopen . c 
Idclose .  c 
vldldptr . c 
allocldptr . c 
freeldptr . c 
/usr/lib/uopt 
/usr/lib/ugen 
Idopen . c 
Idclose . c 
vldldptr . c 
allocldptr. c 
freeldptr , c 
/usr/lib/asO 
/usr/lib/asl 
/usr/lib/crtO  .o 


Mips  Computer  Systems 
Mips  Computer  Systems 
Mips  Computer  Systems 
Mips  Coaaputer  Systems 
Mips  Computer  Systems 
Mips  Coo^uter  Systems 


Release  1 . 10c 
Release  1 . lOg 
Release  1.10c 
Release  1.1 Oh 
Release  1 . 10c 
Release  1 . 10b 


1.3  2/16/83 
1.3  2/16/83 
1.1  1/8/82 
1.2  2/16/83 
1.1  1/7/82 


Mips  Computer  Systems  Release  1 . lOe 
Mips  Coo^uter  Systems  Release  l.lOj 
1.3  2/16/83 
1.3  2/16/83 
1.1  1/8/82 
1.2  2/16/83 
1.1  1/7/82 


Mips  Computer  Systems  Release  l.lOf 
Mips  Computer  Systems  Release  l.lOf 
unknown 
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/usr/lib/libc . a 
/usr/bin/ld 


C  Compiler  Components  (contd.) 

unknown 


Mips  Computer  Systems  Release  l.lOh 
Mips  Computer  Systems  1.10 
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Fortran-77  Compiler  Components  (contd.) 


cbrt . c 

1.1 

(Berkeley) 

5/23/85 

cabs  .  c 

1.2 

(Berkeley) 

8/21/85 

log  L.c 

1.2 

(Berkeley) 

8/21/85 

loglp . c 

1.3 

(Berkeley) 

8/21/85 

exp  E . c 

1.2 

(Berkeley) 

8/21/85 

expml . c 

1.2 

(Berkeley) 

8/21/85 

asinh . c 

1.2 

(Berkeley) 

8/21/85 

acosh . c 

1.2 

(Berkeley) 

8/21/85 

atanh. c 

1.2 

(Berkeley) 

8/21/85 

/usr/lib/libF77 . a 

Mips  Con^uter  Systems  Release 

1.10c 

/usr/lib/libI77 . a 

Mips  Coo^uter  Systems  Release 

l.lOd 

/usr/lib/libU77 . a 

unknown 

/usr/bin/ Id 

Mips  Computer  Systems  Release 

l.lOh 

ill 

Mips  Con^uter  Systems  1.10 

B.3.  Pascal  Compiler 


Pascal 

Compiler  Component 

/ usr/lib/ cpp 
/usr/lib/upas 
/us r /lib/ u join 
/usr/bin/uld 
/usr/lib/usplit 
/usr/lib/umerge 
Idopen . c 
Idclose . c 
vldldptr . c 
allocldptr.c 
f reeldptr .c 
/usr/lib/uopt 
/usr/lib/ugen 
Idopen. c 


Compiler  Components 
Version  Number 
Mips  Con^uter  Systems 
Mips  Con^uter  Systems 
Mips  Coo^uter  Systems 
Mips  Coo^uter  Systems 
Mips  Conqputer  Systems 
Mips  CoB^uter  Systems 

1.3  2/16/83 

1.3  2/16/83 
1.1  1/8/82 

1.2  2/16/83 
1.1  1/7/82 

Mips  Coaoputes  Systems 
Mips  Conputer  Systems 

1.3  2/16/83 


Release  1.10c 
Release  1.1 Oe 
Release  1.10c 
Release  1 . lOh 
Release  1.10c 
Release  1.10b 


Release  l.lOe 
Release  l.lOj 
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Pascal  Compiler  Components  (contd.) 


Idclose  .  c 

1.3  2/16/83 

vldldptr . c 

1.1  1/8/82 

allocldptr . c 

1.2  2/16/83 

f reeldptr . c 

1.1  1/7/82 

/usr/lib/asO 

Mips  Computer  Systems  Release 

l.lOf 

/usr/lib/ asl 

Mips  Computer  Systems  Release 

l.lOf 

/usr/lib/crtO . o 

unknown 

/usr/iib/libc . a 

unknown 

/usr /lib/libp . a 

Mips  Coo^uter  Systems  Release 

l.lOd 

/usr/lib/libm. a 

Mips  Coo^uter  Systems  Release 

1.10b 

pow  *  c 

4 . 5  (Berkeley) 

8/21/85 

support . C 

1 . 1  (Berkeley) 

5/23/85 

cbrt . c 

1.1  (Berkeley) 

5/23/85 

cabs  .  c 

1 . 2  (Berkeley) 

8/21/85 

log _ L . c 

1.2  (Berkeley) 

8/21/85 

loglp . c 

1.3  (Berkeley) 

8/21/85 

exp _ E . c 

1.2  (Berkeley) 

8/21/85 

expml . c 

1.2  (Berkeley) 

8/21/85 

asinh .  c 

1 . 2  (Berkeley) 

8/21/85 

acosh . c 

1.2  (Berkeley) 

8/21/85 

1 

atanh . c 

1 . 2  (Berkeley) 

8/21/85 

/usr/bin/ld 

Mips  Con^uter  Systems  Release 

l.lOh 

pc 

Mips  Cooputer  Systems  1.10 
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Appendix  C:  Conformance  with  CORE  Instruction  Set 
Architecture 

The  key  evidence  that  the  Mips  machine  conforms  to  CORE  ISA  [CORE  87]  is  the  existence  of  a 
translator  from  the  CORE  assembler  code  to  the  Mips  high-level  assembler.  However,  the  close¬ 
ness  with  which  the  Mips  M/500  conforms  can  be  established  only  by  a  feature  analysis,  which  is 
given  in  this  appendix.  In  all  cases.  Mips  refers  to  the  true  instruction  set  of  the  Mips  M/500  ma¬ 
chine,  not  to  the  high-level  assembler.  The  latter  is  superficially  closer  to  CORE,  but  we  think  it 
appropriate  to  measure  conformance  in  terms  of  what  the  machine  actually  executes. 


C.1 .  Registers 

The  CORE  ISA  allows  the  machine  registers  to  be  represented  in  two  ways:  by  absolute  names  and 
by  logical  resource  names. 

C.1.1.  Absolute  Registers 

The  CORE  [Section  2.2.1]  requires  at  least  16  integer  registers  (0.  .15)  and  4  double-precision 
floating-point  registers  (f0..f3).  The  Mips  M/500  provides  27  free  integer  registers  and  32 
floating-point  (or  16  double-precision  floating-point)  registers,  and  so  conforms. 

C.1.2.  Logical  Registers 

The  CORE  [Figure  2-3]  defines  sets  of  logical  registers  with  specific  functions.  The  Mips  assembler 
conventions  define  a  very  similar  set,  as  shown  in  the  following  table: 


CORE 

Mips 

Ccxnngaent 

.sp 

sp 

stack  pointer 

•fp 

fp 

frazoe  pointer 

.Ir 

ra 

procedure  return  link 

.fr 

vO . . vl 

function  result 

.gx 

vO . . vl 

expression  evaluation 

.aX 

aO. .a3 

argument  transmission 

.tx 

to. .t7 

tes^oraries 

.sX 

sO . .87 

locals  (saved  across  calls) 

•  gp 

gp 

global  pointer 

.fX 

fO. .f31 

floating-point  registers 

.  z 

rO 

zero  register 

In  all  cases,  the  Mips  provides  at  least  the  minimum  required  number  of  each  resource  type. 


C.2.  Data  Types 

The  CORE  [Section  2.1]  specifies  byte,  halfword,  and  word  integer  types,  and  single-  and  double¬ 
precision  floating  types.  The  Mips  M/500  provides  all  these,  and  in  addition  has  unsigned  byte  and 
halfword  types. 
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The  CORf  [Section  2.1]  requires  natural  alignnnent®^  for  all  data  types.  Mips  recommends  observ¬ 
ing  this  requirement,  but  in  fact  permits  double-word  values  to  be  aligned  on  word  boundaries. 


C.2.1.  Integer  Operations 

The  CORE  [Section  2.2]  requires  both  overflowing  and  non-overflowing  operations.  The  Mips  M/500 
provides  both,  except  that  overflow  on  division  is  implemented  by  a  software  check.  This  is  a 
permissible  deviation. 

The  CORE  [Section  3.1]  requires  the  following  integer  operations: 

abs  add.  div  mod  mul  neg  rem  sub 

The  Mips  M/500  provides  them  in  the  following  manner: 

•  abs  is  implemented  by  a  conditional  branch  around  a  negate. 

•  div  and  rem  are  implemented  as  one  operation  yielding  both  quotient  and  remainder. 

•  neg  is  implemented  by  subtraction  from  zero. 

•  The  other  instructions  are  implemented  as  given  in  CORE. 

C.2.2.  Logical  Operations 

The  CORE  [Section  3.1 .1]  requires  the  following  logical  operations: 
and  not  or  3cor 

On  the  Mips  M/500,  not  is  implemented  by  nor  with  zero,  with  the  other  instructions  as  in  the  CORE 
specification. 

C.2.3.  Shift  Operations 

The  CORE  [Section  3.2]  defines  the  following  shift  operations: 

•  sll  (shift  left  logical) 

•  srl  (shift  right  logical) 

•  sra  (shift  right  arithmetic) 

•  roi  (rotate  left) 

•  ror  (rotate  right) 

for  single-word  operands.  The  Mips  M/500  implements  sii,  srl,  and  sra  directly.  It  expands  the 
rotate  instructions  into  three-instruction  sequences,  which  is  not  unreasonable  given  that  no  common 
high-level  language  can  generate  rotates.  The  CORE  [Section  3.2.2]  also  requires  the  same  opera¬ 
tions  with  double-word  operands.  The  Mips  assembler  does  not  provide  these  operations;  they  must 
be  constructed  out  of  the  single- word  forms. 


^Natural  alignment  means  the  address  of  any  variable  of  that  type  must  be  an  exact  multiple  of  the  size  of  the  type. 
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C.3.  Load  and  Store  Operations 

The  CORE  [Section  3.3]  defines  load,  store,  and  load  address  instructions.  The  Mips  M/500  pro¬ 
vides  all  these,  and  in  addition  two  load  immediate  instructions  (lui  and  li),  which  together  allow 
constants  of  up  to  32  bits  to  be  loaded  from  the  instruction  stream.  The  other  operand  of  all  load  and 
store  instructions  is  a  register  in  both  CORE  and  the  Mips  M/500. 

C.3.1.  Addressing  Modes 

The  CORE  [Section  3.3.1]  requires  all  addressing  modes  of  the  following  form; 
relocatable  +  absolute  (register) 

with  all  three  components  optional.  Mips  provides  exactly  these  modes,  but  requires  the  relocated 
offset  to  be  representable  as  a  signed  16-bit  quantity.  Many  static  addresses  must  therefore  be 
constructed  by  first  loading  the  upper  16  bits  into  a  temporary  register;  the  defects  of  this  process  are 
discussed  in  Chapter  7. 

The  CORE  [Section  3.1.1]  also  requires  a  register-to-register  move,  which  is  provided  by  the  Mips 
M/500  move,  mov .  s,  and  mov .  d  instructions. 


C.4.  Control  Transfers 

C.4.1.  Branch  and  Jump  Instructions 

The  CORE  [Section  3.4.1]  requires  an  unconditional  branch  and  the  full  set  of  conditional  branches. 
The  Mips  m/500  does  not  provide  this.  Instead,  it  uses  a  combination  of  the  “set"  instructions  and 
the  branch  on  zero/non-zero  to  construct  all  possible  branch  idioms.  Defects  of  this  process  are 
shown  in  Appendix  A. 

The  CORE  says  nothing  about  the  possible  range  of  a  branch.  The  Mips  M/500  provides  a  signed 
16-bit  word  offset,  which  should  be  enough  for  all  but  the  traditional  "pathological  cases." 

The  CORE  [Section  3.4.2]  also  requires  a  general  jump  instruction  to  a  destination  whose  value  is 
held  in  a  register.  The  Mips  M/500  provides  exactly  this  instruction. 

C.4.2.  Call  Instruction 

The  CORE  [Section  3.4.3]  requires  a  call  instruction  of  the  following  form: 
cal  target,  link 

where  the  target  can  be  a  label  or  the  contents  of  a  register,  and  the  link  can  be  a  register  or  a  based 
address. 

The  Mips  M/500  provides  three  instructions,  bai,  jai,  and  jair,  according  to  whether  the  target  is 
a  label,  a  general  address,  or  a  value  in  a  register.  In  all  cases,  the  return  link  is  stored  into  ra. 
However,  the  next  instruction  after  the  call  is  executed  immediately,  so  that  instruction  should  store 
the  link  if  necessary.  The  Mips  M/500  also  provides  a  conditional  call  instruction,  bgezai. 


CMU/SEI-87-TR-29 


205 


C.4.3.  Trap  Instruction 

The  CORE  [Section  3.4.4]  requires  a  trap  instruction  that  transfers  control  synchronously  to  an  ex¬ 
ception  handler  with  a  status  code  in  the  range  0..255.  The  Mips  M/500  provides  a  break  instruction 
with  equivalent  functionality. 


C.5.  Floating-Point  instructions 

The  CORE  [Section  3.5]  defines  a  set  of  floating-point  instructions.  The  Mips  M/500  defines  a  set  of 
general  co-processor  instructions,  which  in  the  special  case  of  a  floating-point  co-processor  become 
floating-point  instructions. 

C.5.1.  Floating-Point  Load  and  Store 

The  CORE  [Section  3.5.1]  defines  load-and-store  operations  for  both  floating  data  types  operating 
between  a  general  address  and  a  floating  register. 

Mips  defines  all  these  operations  at  the  higher  level.  However,  the  double-precision  load  and  store 
expand  into  two  single-precision  loads  and  stores.  This  can  create  further  problems  with  addres¬ 
sability,  as  discussed  in  Section  7.1. 

The  CORE  [Section  3.5.1]  also  defines  loads  and  stores  that  perform  various  conversions  and 
roundings.  The  Mips  M/500  provides  all  the  required  conversions,  but  only  with  register  operands; 
these  CORE  instructions  therefore  expand  into  a  load  and  a  conversion,  or  a  conversion  and  a  store. 
This  is  a  reasonable  simplification  (and  probably  improves  instruction  timing  predictability). 

C.5.2.  Floating  Operations 

The  CORE  [Section  3.5.2]  requires  the  full  following  IEEE  set  of  operations: 

add  sub  isul  div  abs  sqrb 

for  both  single  and  double  precision  operands.  Mips  provides  the  following: 

add  sub  mul  div  abs  neg 

It  does  not  provide  sqrt,  which  must  be  implemented  by  a  routine  call.  This  is  an  understandable 
simplification,  but  regrettable. 

The  CORE  [Section  3.5]  requires  only  round  to  nearest  \o  be  provided.  The  Mips  M/500  provides  all 
the  IEEE  rounding  modes. 

C.5.3.  Floating  Comparisons 

The  CORE  [Section  3.5.3]  requires  the  usual  six  conditional  branches  with  floating  or  double 
operands.  The  Mips  M/500  implements  them  all,  and  in  addition  provides  detailed  control  of  the 
action  to  be  taken  if  the  operands  are  unordered.  This  is  a  most  useful  extension. 
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C.5.4.  Floating  Exceptions 

The  CORE  (Section  3.5.4]  requires  that  the  following  exceptions  be  recognized. 

•  division  by  zero 

•  invalid  operation 

•  overflow 

•  underflow 

The  Mips  M/500  recognizes  and  handles  all  of  them.  It  also  recognizes,  and  can  trap  on,  invalid 
operands,  unordered  comparisons,  and  all  the  interesting  errors  associated  with  infinity. 

Overall,  the  Mips  floating-point  co-processor  provides  a  creditable  implementation  of  the  IEEE  stan¬ 
dard,  which  is  both  more  than  CORE  requires  and  thoroughly  commendable. 


C.6.  Assembler  Directives 

The  CORE  (Appendix  I]  defines  a  set  of  assembler  directives  that  a  conforming  translator  should 
support.  Mips  provides  most  of  these,  though  with  a  Unix  bias. 

C.6.1.  Segments 

The  CORE  (Section  I.2]  requires  the  assembler  to  support  named  segments,  of  any  of  the  types 
(instruction,  data,  common)  with  any  of  the  attributes  (read_only,  absolute,  relocatable, 
based_global). 

Mips  supports  an  extended  set  of  Unix  segments; 

•  .  text  -  instruction,  read_only.  relocatable 

•  .  rdate  -  data,  read_only.  relocatable 

•  .  sdeta  -  data,  relocatable.  based_global 

•  .  date  -  data,  relocatable 

•  .  sbss  -  common,  relocatable,  based_global 

•  .bss  -  common,  relocatable 

Named  common  segments  are  generated  by  the  .iconm  directive  and  allocated  to  the  .bss  or 
.  sbss  regions  depending  on  the  size  of  the  segment. 

This  is  deafly  an  evolution  of  the  Unix  view  of  segmentation  and  is  understandable  for  an  assembler 
intended  exdusively  for  UNix-based  code.  However,  it  is  inadequate  for  code  running  under  other 
regimes.  In  particular,  the  inability  to  define  several  based  global  areas,  or  to  access  read-only  data 
through  a  base  pointer,  is  a  serious  handicap,  as  has  been  discussed  in  Sections  6.2.3  and  7.6. 
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C.6.2.  Data  Directives 

The  CORE  [Section  1.5]  requires  the  usual  set  of  directives  for  generating  initialized  and  uninitialized 
static  data  space.  Mips  provides  all  of  them,  as; 

CORE _ Mips _ Cowment 


align 

.align 

align  next  datum 

ascii 

. ascii 

ascii  string 

. asciz 

zero-terminated  ascii  string 

block 

. space 

reserve  uninitialized  space 

byte 

-byte 

byte  data 

double 

. double 

double  precision  data 

float 

. float 

single  precision  data 

half 

.half 

halfword  data 

word 

.word 

word  data 

Mips  also  conforms  exactly  to  the  syntax  of  each  directive. 


C.7.  Local  Conclusions 

The  Mips  M/500  instruction  set  architecture  conforms  very  closely  to  the  CORE  ISA  standard.  The 
few  deviations  are  small  and  can  be  handled  by  simple  macro  substitution  or  peephole  translation. 
Most  of  them  are  justified  by  the  additional  simplicity  they  bring  (and  hence,  one  presumes,  by  cost 
or  performance  advantages). 

The  high-level  Mips  assembler  is  even  closer  to  CORE  and  can  take  on  most  of  the  burden  of 
handling  the  deviations.  The  minor  problems  inherent  in  this  approach  have  been  discussed  else¬ 
where,  and  they  do  not  bear  on  the  issue  of  conformance. 

The  Mips  assembler  directives  are  very  dose  to  those  required  by  CORE,  except  for  restrictions  on 
program  segmentation  that  follow  from  a  Unix  bias.  We  have  argued  elsewhere  that  these  restric¬ 
tions  are  undesirable. 

Overall,  the  Mips  system  is  a  reasonable  and  accurate  realization  of  the  CORE  ISA. 
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